Skip to content

portextract: add file_extract command#402

Open
herbygillot wants to merge 1 commit intomacports:masterfrom
herbygillot:file-extract-trac-50969
Open

portextract: add file_extract command#402
herbygillot wants to merge 1 commit intomacports:masterfrom
herbygillot:file-extract-trac-50969

Conversation

@herbygillot
Copy link
Copy Markdown
Member

Add a new file_extract command that extracts archive files with automatic format detection from file suffixes. Supports gzip, bzip2, xz, lzma, lzip, zstd, compress, tar, zip, 7z, and dmg. Accepts -dirname to override the extraction directory and -type to override suffix-based detection. Relative filenames are resolved in filespath then distpath, and distfile tags are stripped before lookup.

This is intended to eventually replace the use_* extraction switches internally, while the use_* options remain available to Portfile authors.

Includes unit and integration tests, and portfile(7) man page documentation.

Fixes: https://trac.macports.org/ticket/50969

@ryandesign
Copy link
Copy Markdown
Contributor

Thank you!

Is this intended to replace the use_* options for all distfiles or be used in addition to them for supplemental distfiles?

Who calls file_extract? The Portfile, or base on the basis of $distfiles?

How can a port override the extraction method for distfiles with unknown extensions (like .jar which is a .zip file or .crate which is a .tar.gz file) or whose extension is wrong for the type of data it contains (yes, we've encountered that)?

How can a port extract individual files from an archive, rather than the whole archive? curl-ca-bundle is small an example port that does this; php is a big one. Look for extract.post_args in those Portfiles.

@herbygillot
Copy link
Copy Markdown
Member Author

herbygillot commented Mar 31, 2026

Thank you!

Is this intended to replace the use_* options for all distfiles or be used in addition to them for supplemental distfiles?

You are very welcome.

We can change this PR to go in any direction that folks prefer and agree upon.

As of right now, the use_* options are completely untouched and continue to behave as they always have. This PR introduces thefile_extract command alongside everything else.

As of right now, the intention in the future would be to continue keeping the use_* options, but the internal logic for the use_* options would be replaced by file_extract.

Who calls file_extract? The Portfile, or base on the basis of $distfiles?

Right now, nothing calls file_extract. The user would explicitly add one or more file_extract lines to their Portfile. We can change this if desired.

How can a port override the extraction method for distfiles with unknown extensions (like .jar which is a .zip file or .crate which is a .tar.gz file) or whose extension is wrong for the type of data it contains (yes, we've encountered that)?

file_extract -type zip foo.jar. The list of types that file_extract supports are listed in the portfile man page.

How can a port extract individual files from an archive, rather than the whole archive? curl-ca-bundle is small an example port that does this; php is a big one. Look for extract.post_args in those Portfiles.

The command doesn't support this right now, but if that's the desire, then how do you think the command should look like? As it is right now, each additional argument to file_extract is treated as an additional archive to extract. Should it be:

a) file_extract <archive1> <archive2> <archive3> ...

-or

b) file_extract <archive> <target_file_to_extract_1> <target_file_to_extract_2> ...

?

@herbygillot herbygillot force-pushed the file-extract-trac-50969 branch 3 times, most recently from fa7cc83 to 964a12d Compare March 31, 2026 03:56
Add a new file_extract command that extracts archive files with
automatic format detection from file suffixes. Supports gzip, bzip2,
xz, lzma, lzip, zstd, compress, tar, zip, 7z, and dmg. Accepts
-dirname to override the extraction directory and -type to override
suffix-based detection. Relative filenames are resolved in filespath
then distpath, and distfile tags are stripped before lookup.

This is intended to eventually replace the use_* extraction switches
internally, while the use_* options remain available to Portfile
authors.

Includes unit and integration tests, and portfile(7) man page
documentation.

Fixes: https://trac.macports.org/ticket/50969

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@herbygillot herbygillot force-pushed the file-extract-trac-50969 branch from 964a12d to e2b0ffc Compare March 31, 2026 04:04
@herbygillot
Copy link
Copy Markdown
Member Author

One notable change I've made for this command: no matter where it is specified, it will only run during the extract phase.

The option exists to change it into an on-demand "instant" command that extracts the moment it's parsed, but it seems to me like that could cause issues.

@jmroot
Copy link
Copy Markdown
Member

jmroot commented Apr 1, 2026

On master I've made some changes to allow automatic extraction of multiple distfiles out of the box when they use different archive formats. As part of that I refactored a bit to make the filename suffix the primary source of information about how each file should be extracted, but also added a way to override this choice to allow correctly handling archives that are named incorrectly or just unusually. This is basically Rainer's idea from the ML post linked in #50969.

I'm not quite sure where this leaves the original idea of the ticket. The main motivation for the generic extract command seems to have been handling mixed file types, but maybe there are other situations where it's useful?

@RJVB
Copy link
Copy Markdown
Contributor

RJVB commented Apr 2, 2026

The command doesn't support this right now, but if that's the desire, then how do you think the command should look like? As it is right now, each additional argument to file_extract is treated as an additional archive to extract. Should it be:

Is it necessary to do everything via "command line" options, or would it be better to use dedicated file_extract.XYZ "modifiers" like we are used to be dealing with (and that may have a more intuitive mnemonic value)?
Here's an example of how I implemented 2 options to control what is to be extracted from a single huge source tarball:
https://github.com/RJVB/macstrop/blob/ca45ee7bfaba1b10c5fbf318060d9b5ebd33fea9/aqua/qt5-kde-devel/Portfile#L245
(this could translate to a file_extract.only/select/include and a file_extract.skip/exclude option).

The option exists to change it into an on-demand "instant" command that extracts the moment it's parsed, but it seems to me like that could cause issues.

This is one where a --now "command line" option would make sense, supposing the command can be used as a shorthand for writing a whole bunch of code to extract additional files, possibly with appropriate user feedback and everything. Most typically that would be used in a post-extract step, but if the command is concise enough it could be used anyplace where e.g. one has to restore one or more files to "factory defaults" (i.e. it could be cleaner than making backups of those files and restoring them).
Either way the command should probably raise an error when invoked from the Portfile toplevel.

One thought that crosses my mind: some form of sandboxing - overridable if an direct-callable version (--now option as above) were to be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants