Oferecendo suporte para empacotamento em downstream

Status da página:

Esboço

Última revisão:

2025-?

Embora o PyPI e as ferramentas de empacotamento Python, como pip, sejam os principais meios de distribuição de pacotes Python, eles também são frequentemente disponibilizados como parte de outros ecossistemas de empacotamento. Esses esforços de reempacotamento são chamados coletivamente de empacotamento downstream (seus próprios esforços são chamados de empacotamento upstream) e incluem projetos como distribuições Linux, Conda, Homebrew e MacPorts. Geralmente, eles visam fornecer suporte aprimorado para casos de uso que não podem ser tratados apenas por ferramentas de empacotamento Python, como integração nativa com um sistema operacional específico ou compatibilidade garantida com versões específicas de software não Python.

Esta discussão tenta explicar como o empacotamento downstream geralmente é feito e quais desafios adicionais os empacotadores downstream normalmente enfrentam. O objetivo é fornecer algumas diretrizes opcionais que os mantenedores do projeto podem optar por seguir, o que ajuda a tornar o empacotamento downstream significativamente mais fácil (sem impor grandes problemas de manutenção ao projeto upstream). Observe que esta não é uma proposta do tipo “tudo ou nada” — qualquer coisa que os mantenedores upstream possam fazer é útil, mesmo que seja apenas uma pequena parte. Os mantenedores downstream também estão dispostos a preparar patches para resolver esses problemas. Mesclar esses patches pode ser muito útil, pois elimina a necessidade de diferentes downstreams carregarem e continuarem rebaseando os mesmos patches, além do risco de aplicar soluções inconsistentes ao mesmo problema.

Estabelecer um bom relacionamento entre mantenedores de software e empacotadores downstream pode trazer benefícios mútuos. Os downstreams geralmente estão dispostos a compartilhar sua experiência, tempo e hardware para aprimorar seu pacote. Às vezes, eles estão em melhor posição para ver como seu pacote é usado na prática e fornecer informações sobre suas relações com outros pacotes que, de outra forma, exigiriam um esforço significativo para serem obtidas. Os empacotadores geralmente conseguem encontrar bugs antes que seus usuários os encontrem em produção, fornecem relatórios de bugs de boa qualidade e fornecem patches sempre que possível. Por exemplo, eles são regularmente ativos para garantir que os pacotes que redistribuem sejam atualizados para quaisquer problemas de compatibilidade que surjam quando uma nova versão do Python é lançada.

Observe que as construções downstream incluem não apenas a redistribuição binária, mas também construções de código-fonte feitas em sistemas de usuários (em distribuições que priorizam o código-fonte, como o Gentoo Linux, por exemplo).

Forneça distribuições fonte completas

Por que?

A grande maioria dos empacotadores downstream prefere construir pacotes a partir do código-fonte, em vez de usar os pacotes binários fornecidos pelo upstream. Em alguns casos, o uso do código-fonte é realmente necessário para que o pacote seja incluído na distribuição. Isso também se aplica a pacotes Python puros que fornecem rodas universais. Os motivos para usar distribuições de código-fonte podem incluir:

  • Ser possível auditar o código-fonte de todos os pacotes.

  • Ser capaz de executar o conjunto de testes e construir a documentação.

  • Ser capaz de aplicar patches facilmente, incluindo fazer backporting de commits do repositório do projeto e enviar patches de volta ao projeto.

  • Ser capaz de construir em uma plataforma específica que não é coberta por construções upstream.

  • Ser capaz de construir em versões específicas de bibliotecas do sistema.

  • Term um processo de construção consistente em todos os pacotes Python.

Enquanto é possível construir pacotes através do repositório Git, há razões importantes pelas quais é incentivado prover um arquivo estático:

  • Obter um único arquivo costuma ser mais eficiente, confiável e com melhor suporte do que, por exemplo, usar um clone do Git. Isso pode ajudar usuários com baixa conectividade com a internet.

  • Os downstreams costumam usar hashes para verificar a autenticidade dos arquivos fonte em construções subsequentes, o que exige que eles permaneçam idênticos bit a bit ao longo do tempo. Por exemplo, arquivos Git gerados automaticamente não garantem isso, pois os dados compactados podem mudar se o gzip for atualizado no servidor.

  • Arquivos compactados podem ser espelhados, reduzindo o uso de largura de banda tanto no upstream quanto no downstream. As construções podem ser executadas posteriormente em ambientes com firewall ou offline, que só podem acessar os arquivos fonte fornecidos pelo mirror local ou redistribuídos anteriormente.

  • Publicar explicitamente arquivos compactados pode garantir que quaisquer dependências nos metadados do sistema de controle de versão sejam resolvidas ao criar o arquivo fonte. Por exemplo, arquivos Git gerados automaticamente omitem todas as informações da tag de commit, o que pode resultar em detalhes de versão incorretos nas compilações resultantes.

Como?

O ideal é que um arquivo de distribuição de código-fonte publicado no PyPI inclua todos os arquivos do repositório Git do pacote que são necessários para construir o pacote em si, executar seu conjunto de testes, construir e instalar sua documentação e quaisquer outros arquivos que possam ser úteis para usuários finais, como conclusões de shell, arquivos de suporte do editor e assim por diante.

This point applies only to the files belonging to the package itself. The downstream packaging process, much like Python package managers, will provision the necessary Python dependencies, system tools and external libraries that are needed by your package and its build scripts. However, the files listing these dependencies (for example, requirements*.txt files) should also be included, to help downstreams determine the needed dependencies, and check for changes in them.

Some projects have concerns related to Python package managers using source distributions from PyPI. They do not wish to increase their size with files that are not used by these tools, or they do not wish to publish source distributions at all, as they enable a problematic or outright nonfunctional fallback to building the particular project from source. In these cases, a good compromise may be to publish a separate source archive for downstream use elsewhere, for example by attaching it to a GitHub release. Alternatively, large files, such as test data, can be split into separate archives.

On the other hand, some projects (NumPy, for instance) decide to include tests in their installed packages. This has the added advantage of permitting users to run tests after installing them, for example to check for regressions after upgrading a dependency. Yet another approach is to split tests or test data into a separate Python package. Such an approach was taken by the cryptography project, with the large test vectors being split to cryptography-vectors package.

A good idea is to use your source distribution in the release workflow. For example, the construir tool does exactly that — it first builds a source distribution, and then uses it to build a wheel. This ensures that the source distribution actually works, and that it won’t accidentally install fewer files than the official wheels.

Ideally, also use the source distribution to run tests, build documentation, and so on, or add specific tests to make sure that all necessary files were actually included. Understandably, this requires more effort, so it’s fine not do that — downstream packagers will report any missing files promptly.

Não utilize a internet durante o processo de construção

Por que?

Downstream builds are frequently done in sandboxed environments that cannot access the Internet. The package sources are unpacked into this environment, and all the necessary dependencies are installed.

Mesmo este não sendo o caso, e assumindo que você teve cuidado suficiente para autenticar os downloads, usar a internet é desencorajado pelas seguintes razões:

  • The Internet connection may be unstable (e.g. due to poor reception) or suffer from temporary problems that could cause the process to fail or hang.

  • The remote resources may become temporarily or even permanently unavailable, making the build no longer possible. This is especially problematic when someone needs to build an old package version.

  • The remote resources may change, making the build not reproducible.

  • Accessing remote servers poses a privacy issue and a potential security issue, as it exposes information about the system building the package.

  • The user may be using a service with a limited data plan, in which uncontrolled Internet access may result in additional charges or other inconveniences.

Como?

If the package is implementing any custom build backend actions that use the Internet, for example by automatically downloading vendored dependencies or fetching Git submodules, its source distribution should either include all of these files or allow provisioning them externally, and the Internet must not be used if the files are already present.

Note that this point does not apply to Python dependencies that are specified in the package metadata, and are fetched during the build and installation process by frontends (such as construir or pip). Downstreams use frontends that use local provisioning for Python dependencies.

Ideally, custom build scripts should not even attempt to access the Internet at all, unless explicitly requested to. If any resources are missing and need to be fetched, they should ask the user for permission first. If that is not feasible, the next best thing is to provide an opt-out switch to disable all Internet access. This could be done e.g. by checking whether a NO_NETWORK environment variable is set to a non-empty value.

Since downstreams frequently also run tests and build documentation, the above should ideally extend to these processes as well.

Please also remember that if you are fetching remote resources, you absolutely must verify their authenticity (usually against a hash), to protect against the file being substituted by a malicious party.

Support building against system dependencies

Por que?

Some Python projects have non-Python dependencies, such as libraries written in C or C++. Trying to use the system versions of these dependencies in upstream packaging may cause a number of problems for end users:

  • The published wheels require a binary-compatible version of the used library to be present on the user’s system. If the library is missing or an incompatible version is installed, the Python package may fail with errors that are not clear to inexperienced users, or even misbehave at runtime.

  • Building from a source distribution requires a source-compatible version of the dependency to be present, along with its development headers and other auxiliary files that some systems package separately from the library itself.

  • Mesmo para um usuário experiente, instalar uma versão de dependência compatível pode ser difícil. Por exemplo, a distribuição Linux utilizada pode não prover a versão necessária, ou um pacote diferente pode exigir uma versão incompatível.

  • The linkage between the Python package and its system dependency is not recorded by the packaging system. The next system update may upgrade the library to a newer version that breaks binary compatibility with the Python package, and requires user intervention to fix.

For these reasons, you may reasonably decide to either statically link your dependencies, or to provide local copies in the installed package. You may also vendor the dependency in your source distribution. Sometimes these dependencies are also repackaged on PyPI, and can be declared as project dependencies like any other Python package.

However, none of these issues apply to downstream packaging, and downstreams have good reasons to prefer dynamically linking to system dependencies. In particular:

  • In many cases, reliably sharing dynamic dependencies between components is a large part of the purpose of a downstream packaging ecosystem. Helping to support that makes it easier for users of those systems to access upstream projects in their preferred format.

  • Static linking and vendoring obscures the use of external dependencies, making source auditing harder.

  • Dynamic linking makes it possible to quickly and systematically replace the used libraries across an entire downstream packaging ecosystem, which can be particularly important when they turn out to contain a security vulnerability or critical bug.

  • Using system dependencies makes the package benefit from downstream customization that can improve the user experience on a particular platform, without the downstream maintainers having to consistently patch the dependencies vendored in different packages. This can include compatibility improvements and security hardening.

  • Static linking and vendoring can result in multiple different versions of the same library being loaded in the same process (for example, attempting to import two Python packages that link to different versions of the same library). This sometimes works without incident, but it can also lead to anything from library loading errors, to subtle runtime bugs, to catastrophic failures (like suddenly crashing and losing data).

  • Por último, mas não menos importante, a ligação estática e o fornecimento resulta em duplicação, e pode aumentar o uso de espaço e memória de disco.

Como?

A good compromise between the needs of both parties is to provide a switch between using vendored and system dependencies. Ideally, if the package has multiple vendored dependencies, it should provide both individual switches for each dependency, and a general switch to control the default for them, e.g. via a USE_SYSTEM_DEPS environment variable.

If the user requests using system dependencies, and a particular dependency is either missing or incompatible, the build should fail with an explanatory message rather than fall back to a vendored version. This gives the packager the opportunity to notice their mistake and a chance to consciously decide how to solve it.

It is reasonable for upstream projects to leave testing of building with system dependencies to their downstream repackagers. The goal of these guidelines is to facilitate more effective collaboration between upstream projects and downstream repackagers, not to suggest upstream projects take on tasks that downstream repackagers are better equipped to handle.

Support downstream testing

Por que?

A variety of downstream projects run some degree of testing on the packaged Python projects. Depending on the particular case, this can range from minimal smoke testing to comprehensive runs of the complete test suite. There can be various reasons for doing this, for example:

  • Verifying that the downstream packaging did not introduce any bugs.

  • Testing on additional platforms that are not covered by upstream testing.

  • Encontrando bugs sutis que só podem ser reproduzidos com hardware específico, certas versões de pacotes de sistemas, e assim por diante.

  • Testing the released package against newer (or older) dependency versions than the ones present during upstream release testing.

  • Testing the package in an environment closely resembling the production setup. This can detect issues caused by non-trivial interactions between different installed packages, including packages that are not dependencies of your package, but nevertheless can cause issues.

  • Testing the released package against newer Python versions (including newer point releases), or less tested Python implementations such as PyPy.

Admittedly, sometimes downstream testing may yield false positives or bug reports about scenarios the upstream project is not interested in supporting. However, perhaps even more often it does provide early notice of problems, or find non-trivial bugs that would otherwise cause issues for the upstream project’s users. While mistakes do happen, the majority of downstream packagers are doing their best to double-check their results, and help upstream maintainers triage and fix the bugs that they reported.

Como?

There are a number of things that upstream projects can do to help downstream repackagers test their packages efficiently and effectively, including some of the suggestions already mentioned above. These are typically improvements that make the test suite more reliable and easier to use for everyone, not just downstream packagers. Some specific suggestions are:

  • Include the test files and fixtures in the source distribution, or make it possible to easily download them separately.

  • Do not write to the package directories during testing. Downstream test setups sometimes run tests on top of the installed package, and modifications performed during testing and temporary test files may end up being part of the installed package!

  • Make the test suite work offline. Mock network interactions, using packages such as responses or vcrpy. If that is not possible, make it possible to easily disable the tests using Internet access, e.g. via a pytest marker. Use pytest-socket to verify that your tests work offline. This often makes your own test workflows faster and more reliable as well.

  • Make your tests work without a specialized setup, or perform the necessary setup as part of test fixtures. Do not ever assume that you can connect to system services such as databases — in an extreme case, you could crash a production service!

  • If your package has optional dependencies, make their tests optional as well. Either skip them if the needed packages are not installed, or add markers to make deselecting easy.

  • More generally, add markers to tests with special requirements. These can include e.g. significant space usage, significant memory usage, long runtime, incompatibility with parallel testing.

  • Do not assume that the test suite will be run with -Werror. Downstreams often need to disable that, as it causes false positives, e.g. due to newer dependency versions. Assert for warnings using pytest.warns() rather than pytest.raises()!

  • Aim to make your test suite reliable and reproducible. Avoid flaky tests. Avoid depending on specific platform details, don’t rely on exact results of floating-point computation, or timing of operations, and so on. Fuzzing has its advantages, but you want to have static test cases for completeness as well.

  • Split tests by their purpose, and make it easy to skip categories that are irrelevant or problematic. Since the primary purpose of downstream testing is to ensure that the package itself works, downstreams are not generally interested in tasks such as checking code coverage, code formatting, typechecking or running benchmarks. These tests can fail as dependencies are upgraded or the system is under load, without actually affecting the package itself.

  • If your test suite takes significant time to run, support testing in parallel. Downstreams often maintain a large number of packages, and testing them all takes a lot of time. Using pytest-xdist can help them avoid bottlenecks.

  • Ideally, support running your test suite via pytest. pytest has many command-line arguments that are truly helpful to downstreams, such as the ability to conveniently deselect tests, rerun flaky tests (via pytest-rerunfailures), add a timeout to prevent tests from hanging (via pytest-timeout) or run tests in parallel (via pytest-xdist). Note that test suites don’t need to be written with pytest to be executed with pytest: pytest is able to find and execute almost all test cases that are compatible with the standard library’s unittest test discovery.

Aim for stable releases

Por que?

Many downstreams provide stable release channels in addition to the main package streams. The goal of these channels is to provide more conservative upgrades to users with higher stability needs. These users often prefer to trade having the newest features available for lower risk of issues.

While the exact policies differ, an important criterion for including a new package version in a stable release channel is for it to be available in testing for some time already, and have no known major regressions. For example, in Gentoo Linux a package is usually marked stable after being available in testing for a month, and being tested against the versions of its dependencies that are marked stable at the time.

However, there are circumstances which demand more prompt action. For example, if a security vulnerability or a major bug is found in the version that is currently available in the stable channel, the downstream is facing a need to resolve it. In this case, they need to consider various options, such as:

  • putting a new version in the stable channel early,

  • adding patches to the version currently published,

  • ou até retornando o canal estável a uma versão anterior ao lançamento.

Cada opção envolve certos riscos e certo trabalho, e os empacotadoes devem ponderar suas opções para determinar seu curso de ação .

Como?

There are some things that upstreams can do to tailor their workflow to stable release channels. These actions often are beneficial to the package’s users as well. Some specific suggestions are:

  • Adjust the release frequency to the rate of code changes. Packages that are released rarely often bring significant changes with every release, and a higher risk of accidental regressions.

  • Avoid mixing bug fixes and new features, if possible. In particular, if there are known bug fixes merged already, consider making a new release before merging feature branches.

  • Considere fazer pré-lançamentos após grandes mudanças, provendo assim mais oportunidade de teste para usuários e pessoas interessadas.

  • If your project is subject to very intense development, consider splitting one or more branches that include a more conservative subset of commits, and are released separately. For example, Django currently maintains three release branches in addition to main.

  • Even if you don’t wish to maintain additional branches permanently, consider making additional patch releases with minimal changes to the previous version, especially when a security vulnerability is discovered.

  • Split your changes into focused commits that address one problem at a time, to make it easier to cherry-pick changes to earlier releases when necessary.