Repeated DNA Sequences

解題說明

Repeated DNA Sequences

題目描述：給定一個 DNA 序列字串 s（僅含 'A', 'C', 'G', 'T'），找出所有長度為 10 且出現超過一次的子字串。

解題思路：使用雜湊表（HashSet）搭配滑動視窗：

用一個 seen 集合記錄已出現過的長度為 10 的子字串。
用一個 result 集合記錄重複出現的子字串（避免重複加入結果）。
從左到右滑動視窗，每次取長度 10 的子字串：
- 若已在 seen 中，加入 result。
- 否則加入 seen。
回傳 result 中的所有子字串。

使用 unordered_set<string> 即可高效解決。進階可用位元編碼（每個字母用 2 bit）將子字串壓縮成整數，減少記憶體與比較開銷。

1. If string length <= 10, return empty list 2. Initialize two sets: seen (visited substrings), repeated (duplicates) 3. For i from 0 to len(s) - 10: a. Extract substring sub = s[i..i+9] (length 10) b. If sub in seen: add to repeated c. Else: add to seen 4. Return all strings in repeated

其他解法

位元編碼 Rolling Hash：將 A=0, C=1, G=2, T=3 各用 2 bit 表示，10 個字母用 20 bit 整數表示。滑動視窗時，移除最高位、左移、加入新字母，O(1) 更新雜湊值。用 unordered_set<int> 替代字串集合，減少記憶體使用約 5 倍。時間 O(n)，空間更優。

Rabin-Karp Rolling Hash：使用多項式雜湊函數做滑動視窗，遇到碰撞時再比較原字串。概念上與位元編碼類似，但可用於更長的子字串（超過 32 bit 能表示的範圍）。此題子字串固定 10 個字母，位元編碼更簡潔。

延伸思考

若子字串長度不是固定的 10 而是任意 k，算法需要如何調整？
位元編碼方式中，若字母表大小超過 4（如蛋白質序列有 20 種氨基酸），每個字母需要多少 bit？整體方法是否仍適用？
如何找出出現次數最多的長度為 10 的子字串？

解題說明