Fix duplicated subtitle issue-- initialize cache directory + local cached subtitles can be manually downloaded as *.srt#13143
Open
TransZAllen wants to merge 13 commits intoTeamNewPipe:devfrom
Conversation
6 tasks
Contributor
Author
|
Hi, Just a quick note that I'm still working on the subtitle deduplication part. Found a new case today that is not handled yet, so I'm adjusting the logic. I'll update the code soon. |
…or to app side. [Bug] Fix duplicated subtitle issue. - Add core deduplicated logic/method - Reproduce bug with the YouTube video: https://www.youtube.com/watch?v=b7vmW_5HSpE (Observed around 2026-03-03: the subtitle language that previously had duplication issue no longer appears in the captions list) - Introduce `SubtitleDeduplicator.java` to check and remove duplicates, storing results in cache. - Add `SubtitleOrigin` and `SubtitleState` enums to model subtitle type and state. - Ensure cache directory is recreated if missing.
…eExtractor to NewPipe repository. - Changed `package` and `import` statements to adapt to NewPipe main repository. - Replace `javax.annotation.Nonnull` with `androidx.annotation.NonNull` for compatibility with `androidx`, replacing `javax`.
…Tube-related URLs. - `SubtitleDeduplicator` relies on YouTube-specific subtitle URL semantics (videoId, languageCode, translationCode) for cache file naming and deduplication. - Add `isYoutubeRelatedUrl()` to ensure deduplication logic is only applied to YouTube URLs. For non-YouTube subtitle URLs, the original subtitle URL is returned unchanged.
…ry selection. - This commit introduces `CacheDirUtils` to centralize application cache directory selection logic. - The preferred cache directory path is now initialized in 'App.onCreate()' and passed to `SubtitleDeduplicator`, instead of relying on 'StateSaver.init().'
… NewPipeExtractor to app side. Add unit tests for `SubtitleDeduplicator` in `SubtitleDeduplicatorTest.java`.
…on-cache` errors after moving from NewPipeExtractor to NewPipe repository. - "error: static import only from classes and interfaces" - Changed `package` and `import` statements to adapt to NewPipe main repository. - Name 'containsDuplicatedEntries_exactDuplicate_shouldReturnTrue' must match pattern '^[a-z][a-zA-Z0-9]*$'. - Variable 'expected' should be declared final. - '+' should be on a new line. - Line is longer than 100 characters
…ubtitles in app layer. Introduce a new domain class `AppStreamInfo` to perform subtitle normalization (deduplication) on the application side without modifying the extractor data `StreamInfo`. Previously subtitle deduplication logic modified `StreamInfo` directly, which mixes application concerns with extractor data structures. This change separates responsibilities by projecting extractor `StreamInfo` into an app-level domain object. Key points: - Preserve original `StreamInfo` from the extractor unchanged - Perform subtitle deduplication once when constructing `AppStreamInfo` - Provide normalized subtitle list for player and download usage - Ensure subtitle normalization logic is centralized and reusable - `from()` only perform object creation - subtitle deduplication (which requires network downloading) is in the loadNormalizedSubtitles().
…reen display (with ExoPlayer module). - Replace StreamInfo.getSubtitles() with AppStreamInfo.loadNormalizedSubtitles() to download TTML subtitles and deduplicate them. - Note: each module calling subtitle normalization will perform network download independently. AppStreamInfo cannot be shared across modules like StreamInfo.
…ownloads - After remote subtitles (TTML format) are downloaded, the subtitle content is processed by SubtitleDeduplicator to remove duplicated segments. The cleaned content is then passed to SrtFromTtmlWriter to generate the final SRT subtitle file without duplicated entries. - This logic is platform-independent and does not distinguish whether the video source is YouTube or another platform. - A minimal ByteArraySharpStream implementation is used to adapt the deduplicated byte content back into the SharpStream interface without modifying existing stream APIs. - Add comment explaining why `> 0` is used when reading SharpStream.
…tyle tags and normalize subtitle text content - Helps handle YouTube subtitles that have different style attributes but the same text and timestamps - Add SUPPORT_STYLED_SUBTITLE_RENDERING flag for future styled subtitle support (currently not supported in NewPipe) - Remove invisible Unicode characters (zero-width and directionality controls) - Handle non-breaking spaces, BOM (U+FEFF), multiple spaces, and leading/trailing spaces - This commit is tested with: https://www.youtube.com/watch?v=7w3jBGX7UcY
…without changing logic - Rename methods: - getSubtitleKeyOfTtml() → buildDeduplicationKey() - storeItToCacheDir() → writeContentToCacheFile() - Rename variables: - subtitleContent → ttmlFileContent - seen → processedKeys - subCacheDir → SUBTITLE_DEDUP_CACHE_DIR - Improve deduplicateContent() by clarifying variable names and adding comments - Add comments to explain the logic (No functional changes)
…ttern once for efficiency - Replace defineTtmlSubtitlePattern() method with a static final Pattern - getTtmlMatcher() now reuses the precompiled pattern instead of recompiling each call - Improves performance when processing multiple subtitles without changing behavior
4b86db0 to
a87a8da
Compare
Contributor
Author
|
Hi, Just a quick update. I have pushed a new set of changes (13 commits) for this PR. For context, most of the earlier discussion and review comments are in the extractor-side PR: I will continue posting detailed updates there. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is it?
refactorbranchDescription of the changes in your PR
This PR is a supporting changes.
cache/subtitle_cachedirectory;Relies on the following changes
Here is the main changes:
Before/After Screenshots/Screen Record
This PR just tests manually download '*.srt', with a YouTube video link: https://www.youtube.com/watch?v=b7vmW_5HSpE
SRT subtitle file is downloaded as 'Violent Vincent_ Land Mines _ OFFICIAL MUSIC VIDEO-en-GB.srt',
Before:
After:
Fixes the following issue(s)
Due diligence