Skip to content

Fix duplicated subtitle issue-- initialize cache directory + local cached subtitles can be manually downloaded as *.srt#13143

Open
TransZAllen wants to merge 13 commits intoTeamNewPipe:devfrom
TransZAllen:duplicated_subtitle_5_on_newest_dev
Open

Fix duplicated subtitle issue-- initialize cache directory + local cached subtitles can be manually downloaded as *.srt#13143
TransZAllen wants to merge 13 commits intoTeamNewPipe:devfrom
TransZAllen:duplicated_subtitle_5_on_newest_dev

Conversation

@TransZAllen
Copy link
Copy Markdown
Contributor

@TransZAllen TransZAllen commented Jan 30, 2026

What is it?

  • [ √ ] Bugfix (user facing)
  • Feature (user facing) ⚠️ Your PR must target the refactor branch
  • [ √ ] Codebase improvement (dev facing)
  • Meta improvement to the project (dev facing)

Description of the changes in your PR

This PR is a supporting changes.

  1. Initializes cache/subtitle_cache directory;
  2. Ensures locally cached subtitle files can still be manually downloaded as '*.srt'. Before this change, only remote subtitle URL is supported to be manually downloaded.

Relies on the following changes

Here is the main changes:

Before/After Screenshots/Screen Record

This PR just tests manually download '*.srt', with a YouTube video link: https://www.youtube.com/watch?v=b7vmW_5HSpE
SRT subtitle file is downloaded as 'Violent Vincent_ Land Mines _ OFFICIAL MUSIC VIDEO-en-GB.srt',

  • Before:

      double_subtitle/not_fixed$ cat 'Violent Vincent_ Land Mines _ OFFICIAL MUSIC VIDEO-en-GB.srt' | more
      1
      00:00:01,352 --> 00:00:01,852
       Land mines 
       I've left them everywhere 
      
      2
      00:00:01,352 --> 00:00:01,852
       Land mines 
       I've left them everywhere 
      
      3
      00:00:01,852 --> 00:00:02,619
       Land mines 
       I've left them everywhere 
      
      4
      00:00:01,852 --> 00:00:02,619
       Land mines 
       I've left them everywhere  
    
  • After:

    double_subtitle/fixed$ cat 'Violent Vincent_ Land Mines _ OFFICIAL MUSIC VIDEO-en-GB.srt' | more
    1
    00:00:01,352 --> 00:00:01,852
     Land mines 
     I've left them everywhere 
    
    2
    00:00:01,852 --> 00:00:02,619
     Land mines 
     I've left them everywhere
    

Fixes the following issue(s)

Due diligence

  • [ √ ] I read the contribution guidelines.
  • [ √ ] The proposed changes follow the AI policy.
  • [ √ ] I tested the changes using an emulator or a physical device.

@github-actions github-actions bot added the size/medium PRs with less than 250 changed lines label Jan 30, 2026
@TransZAllen TransZAllen changed the title Open Fix duplicated subtitle issue-- initialize cache directory + local cached subtitles can be manually downloaded as *.srt Fix duplicated subtitle issue-- initialize cache directory + local cached subtitles can be manually downloaded as *.srt Jan 31, 2026
@ShareASmile ShareASmile added bug Issue is related to a bug subtitles Related to displaying, converting or saving subtitles or captions. labels Feb 25, 2026
@TransZAllen
Copy link
Copy Markdown
Contributor Author

Hi,

Just a quick note that I'm still working on the subtitle deduplication part. Found a new case today that is not handled yet, so I'm adjusting the logic.

I'll update the code soon.

…or to app side. [Bug] Fix duplicated subtitle issue.

- Add core deduplicated logic/method
- Reproduce bug with the YouTube video: https://www.youtube.com/watch?v=b7vmW_5HSpE
  (Observed around 2026-03-03: the subtitle language that previously had duplication issue no longer appears in the captions list)
- Introduce `SubtitleDeduplicator.java` to check and remove duplicates, storing results in cache.
- Add `SubtitleOrigin` and `SubtitleState` enums to model subtitle type and state.
- Ensure cache directory is recreated if missing.
…eExtractor to NewPipe repository.

- Changed `package` and `import` statements to adapt to NewPipe main repository.
- Replace `javax.annotation.Nonnull` with `androidx.annotation.NonNull` for compatibility with `androidx`, replacing `javax`.
…Tube-related URLs.

- `SubtitleDeduplicator` relies on YouTube-specific subtitle URL semantics
  (videoId, languageCode, translationCode) for cache file naming and
  deduplication.

- Add `isYoutubeRelatedUrl()` to ensure deduplication logic is only
  applied to YouTube URLs. For non-YouTube subtitle URLs, the original
  subtitle URL is returned unchanged.
…ry selection.

- This commit introduces `CacheDirUtils` to centralize application
  cache directory selection logic.

- The preferred cache directory path is now initialized in 'App.onCreate()'
  and passed to `SubtitleDeduplicator`, instead of relying on 'StateSaver.init().'
… NewPipeExtractor to app side. Add unit tests for `SubtitleDeduplicator` in `SubtitleDeduplicatorTest.java`.
…on-cache` errors after moving from NewPipeExtractor to NewPipe repository.

- "error: static import only from classes and interfaces"
- Changed `package` and `import` statements to adapt to NewPipe main repository.
- Name 'containsDuplicatedEntries_exactDuplicate_shouldReturnTrue' must match pattern '^[a-z][a-zA-Z0-9]*$'.
- Variable 'expected' should be declared final.
- '+' should be on a new line.
- Line is longer than 100 characters
…ubtitles in app layer.

Introduce a new domain class `AppStreamInfo` to perform subtitle
normalization (deduplication) on the application side without modifying
the extractor data `StreamInfo`.

Previously subtitle deduplication logic modified `StreamInfo` directly,
which mixes application concerns with extractor data structures. This
change separates responsibilities by projecting extractor `StreamInfo`
into an app-level domain object.

Key points:
- Preserve original `StreamInfo` from the extractor unchanged
- Perform subtitle deduplication once when constructing `AppStreamInfo`
- Provide normalized subtitle list for player and download usage
- Ensure subtitle normalization logic is centralized and reusable
- `from()` only perform object creation
- subtitle deduplication (which requires network downloading) is in the
  loadNormalizedSubtitles().
…reen display (with ExoPlayer module).

- Replace StreamInfo.getSubtitles() with AppStreamInfo.loadNormalizedSubtitles()
  to download TTML subtitles and deduplicate them.
- Note: each module calling subtitle normalization will perform network download
  independently. AppStreamInfo cannot be shared across modules like StreamInfo.
…ownloads

- After remote subtitles (TTML format) are downloaded, the subtitle content
  is processed by SubtitleDeduplicator to remove duplicated segments.
  The cleaned content is then passed to SrtFromTtmlWriter to generate
  the final SRT subtitle file without duplicated entries.

- This logic is platform-independent and does not distinguish whether the
  video source is YouTube or another platform.

- A minimal ByteArraySharpStream implementation is used to adapt the
  deduplicated byte content back into the SharpStream interface without
  modifying existing stream APIs.

- Add comment explaining why `> 0` is used when reading SharpStream.
…tyle tags and normalize subtitle text content

- Helps handle YouTube subtitles that have different style attributes but the same text and timestamps
- Add SUPPORT_STYLED_SUBTITLE_RENDERING flag for future styled subtitle support
  (currently not supported in NewPipe)
- Remove invisible Unicode characters (zero-width and directionality controls)
- Handle non-breaking spaces, BOM (U+FEFF), multiple spaces, and leading/trailing spaces
- This commit is tested with: https://www.youtube.com/watch?v=7w3jBGX7UcY
…without changing logic

- Rename methods:
  - getSubtitleKeyOfTtml() → buildDeduplicationKey()
  - storeItToCacheDir() → writeContentToCacheFile()

- Rename variables:
  - subtitleContent → ttmlFileContent
  - seen → processedKeys
  - subCacheDir → SUBTITLE_DEDUP_CACHE_DIR

- Improve deduplicateContent() by clarifying variable names and adding comments

- Add comments to explain the logic

(No functional changes)
…ttern once for efficiency

- Replace defineTtmlSubtitlePattern() method with a static final Pattern
- getTtmlMatcher() now reuses the precompiled pattern instead of recompiling each call
- Improves performance when processing multiple subtitles without changing behavior
@TransZAllen TransZAllen force-pushed the duplicated_subtitle_5_on_newest_dev branch from 4b86db0 to a87a8da Compare April 1, 2026 13:56
@github-actions github-actions bot added size/giant PRs with more than 750 changed lines and removed size/medium PRs with less than 250 changed lines labels Apr 1, 2026
@TransZAllen
Copy link
Copy Markdown
Contributor Author

TransZAllen commented Apr 1, 2026

Hi,

Just a quick update.

I have pushed a new set of changes (13 commits) for this PR.
To keep everything under the same PR, I updated the branch, but unfortunately the previous commits were overwritten.

For context, most of the earlier discussion and review comments are in the extractor-side PR:
TeamNewPipe/NewPipeExtractor#1448

I will continue posting detailed updates there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue is related to a bug size/giant PRs with more than 750 changed lines subtitles Related to displaying, converting or saving subtitles or captions.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subtitles appear twice sometimes

2 participants