Skip to content

Gold patches have 13 tasks failed in SWE-Bench-Verified #20

@KawaiiNotHawaii

Description

@KawaiiNotHawaii

I retrieved the gold patches from the swe-bench-verified dataset and upload using sb-cli for testing. But it results that only 487 passed all the test cases, with 5 marked as 'incompleted' and 8 marked as 'unresolved'.

    "unresolved_ids": [
        "astropy__astropy-7606",
        "astropy__astropy-8707",
        "astropy__astropy-8872",
        "django__django-10097",
        "psf__requests-1724",
        "psf__requests-2317",
        "pylint-dev__pylint-6528",
        "pylint-dev__pylint-7277"
    ],

I then ran mini-swe-agent with claude and upload the preds.json to sb-cli, it turns out that among the unresolved_ids above, two are marked as resolved, which indicates that the gold patch is not really 'gold'...

"django__django-10097",
"psf__requests-1724"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions