Analyzing PyPI package downloads

This section covers how to use the PyPI package dataset to learn more about downloads of a package (or packages) hosted on PyPI. For example, you can use it to discover the distribution of Python versions used to download a package.

Background

PyPI does not display download statistics because they are difficult to collect and display accurately. Reasons for this are included in the announcement email:

There are numerous reasons for [download counts] removal/deprecation some of which are:

  • Technically hard to make work with the new CDN

    • The CDN is being donated to the PSF, and the donated tier does not offer any form of log access
    • The work around for not having log access would greatly reduce the utility of the CDN
  • Highly inaccurate
    • A number of things prevent the download counts from being inaccurate, some of which include:

      • pip download cache
      • Internal or unofficial mirrors
      • Packages not hosted on PyPI (for comparisons sake)
      • Mirrors or unofficial grab scripts causing inflated counts (Last I looked 25% of the downloads were from a known mirroring script).
  • Not particularly useful

    • Just because a project has been downloaded a lot doesn’t mean it’s good
    • Similarly just because a project hasn’t been downloaded a lot doesn’t mean it’s bad

In short because it’s value is low for various reasons, and the tradeoffs required to make it work are high It has been not an effective use of resources.

As an alternative, the Linehaul project streams download logs to Google BigQuery [1]. Linehaul writes an entry in a the-psf.pypi.downloadsYYYYMMDD table for each download. The table contains information about what file was downloaded and how it was downloaded. Some useful columns from the table schema include:

Column Description Examples
file.project Project name pipenv, nose
file.version Package version 0.1.6, 1.4.2
details.installer.name Installer pip, bandersnatch
details.python Python version 2.7.12, 3.6.4
[1]PyPI BigQuery dataset announcement email

Setting up

In order to use Google BigQuery to query the PyPI package dataset, you’ll need a Google account and to enable the BigQuery API on a Google Cloud Platform project. You can run the up to 1TB of queries per month using the BigQuery free tier without a credit card

For more detailed instructions on how to get started with BigQuery, check out the BigQuery quickstart guide.

Useful queries

Run queries in the BigQuery web UI by clicking the “Compose query” button.

Note that the rows are stored in separate tables for each day, which helps limit the cost of queries. These example queries analyze downloads from recent history by using wildcard tables to select all tables and then filter by _TABLE_SUFFIX.

Counting package downloads

The following query counts the total number of downloads for the project “pytest”.

#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pytest'
  -- Only query the last 30 days of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
num_downloads
2117807

To only count downloads from pip, filter on the details.installer.name column.

#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pytest'
  AND details.installer.name = 'pip'
  -- Only query the last 30 days of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
num_downloads
1829322

Package downloads over time

To group by monthly downloads, use the _TABLE_SUFFIX pseudo-column. Also use the pseudo-column to limit the tables queried and the corresponding costs.

#standardSQL
SELECT
  COUNT(*) AS num_downloads,
  SUBSTR(_TABLE_SUFFIX, 1, 6) AS `month`
FROM `the-psf.pypi.downloads*`
WHERE
  file.project = 'pytest'
  -- Only query the last 6 months of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
GROUP BY `month`
ORDER BY `month` DESC
num_downloads month
1956741 201801
2344692 201712
1730398 201711
2047310 201710
1744443 201709
1916952 201708

Additional tools

You can also access the PyPI package dataset programmatically via the BigQuery API.

pypinfo

pypinfo is a command-line tool which provides access to the dataset and can generate several useful queries. For example, you can query the total number of download for a package with the command pypinfo package_name.

$ pypinfo requests
Served from cache: False
Data processed: 6.87 GiB
Data billed: 6.87 GiB
Estimated cost: $0.04

| download_count |
| -------------- |
|      9,316,415 |

Install pypinfo using pip.

pip install pypinfo

Other libraries