A detailed resolution plan is now available.
The b2xtranslator fails to properly extract list formatting during text conversion, resulting in plain text output that lacks bullet points, numbering, and proper indentation. This significantly impacts document readability and structure preservation.
- Loss of document structure and organization
- Missing visual hierarchy from lists
- Inability to distinguish between regular paragraphs and list items
IMPORTANT: Consider that this plan is a guide, but may not be totally accurate, be critical.
The foundation of fixing list extraction is to correctly parse the data structures that define lists in the Word binary format. This involves the List Table (LST) and the List Format Override Table (LFO).
-
Fully Parse
LSTFandLVLStructures: Ensure theDoc/DocFileFormat/Data/ListTable.csand related classes can completely parse theLSTF(List Format) andLVL(Level) structures from the document's Table Stream. Key fields are:lsid: The unique ID for the list.tplc: A template code that defines the list's basic properties.rgistd: An array that maps paragraph styles to list levels.- In the
LVLstructure:nfc(number format code),ixch(character index for bullets), and indentation properties (dxaLeft,dxaIndent).
-
Parse
PAPXfor List Properties: InDoc/DocFileFormat/Structures/ParagraphProperties.cs, thePAPXstructure contains two crucial fields that link a paragraph to a list:ilfo: The index into theLFOtable. This identifies which list override applies.ilvl: The indentation level of the paragraph within the list (0-8). These fields must be reliably extracted for every paragraph.
Once the data is parsed, it needs to be stored in an accessible way in the in-memory document model.
-
Create List Information Classes: Introduce new classes to represent the parsed list data.
// In a new file, e.g., Doc/DocFileFormat/ListInfo.cs public class ListLevelInfo { public int NumberFormatCode { get; set; } // The nfc code public char BulletCharacter { get; set; } // The actual character to use public string NumberFormatString { get; set; } // e.g., "%1." public int IndentDxa { get; set; } } public class ListInfo { public int ListId { get; set; } public List<ListLevelInfo> Levels { get; set; } = new List<ListLevelInfo>(); }
-
Store Parsed Lists in
WordDocument: InWordDocument.cs, add a property to store all the lists defined in the document.// In Doc/DocFileFormat/WordDocument.cs public Dictionary<int, ListInfo> AllLists { get; private set; }
This dictionary will be populated by the
ListTableparser, mapping aListIdto its full definition. -
Link Paragraphs to Lists: Add a property to the paragraph representation to hold its specific list formatting.
// In the class representing a parsed paragraph public class Paragraph { // ... other properties public int ListId { get; set; } = -1; // Default to no list public int ListLevel { get; set; } = -1; }
When parsing the
PAPXfor each paragraph, populate these two fields.
With the list information now available in the model, the TextMapping can be updated to generate the correct output.
-
Create a List State Manager: In
Text/TextMapping/TextMapping.cs, create a mechanism to track the current number for each active list.// In TextMapping.cs private Dictionary<int, int> _listCounters = new Dictionary<int, int>();
-
Update Paragraph Mapping Logic: In the method that processes paragraphs (likely in
TextMapping.csor aParagraphMapping.cs), modify the logic to be list-aware.// In the paragraph processing method protected override void HandleParagraph(Paragraph p) { if (p.ListId != -1) { // This is a list item var list = _wordDocument.AllLists[p.ListId]; var level = list.Levels[p.ListLevel]; // 1. Calculate indentation var indent = new string(' ', p.ListLevel * 2); _writer.Write(indent); // 2. Get bullet or number string bullet; if (level.NumberFormatCode == 23) // Simple bullet { bullet = level.BulletCharacter + " "; } else // Numbered list { if (!_listCounters.ContainsKey(p.ListId)) { _listCounters[p.ListId] = 0; } _listCounters[p.ListId]++; bullet = string.Format(level.NumberFormatString, _listCounters[p.ListId]) + " "; } _writer.Write(bullet); } else { // Not a list item, so reset counters for any list that might have just ended. // (This logic needs to be robust to handle list restarts) _listCounters.Clear(); } // 3. Write the actual paragraph text _writer.WriteLine(p.Text); }
Text/TextMapping/TextMapping.csDoc/DocFileFormat/Data/ListTable.csDoc/DocFileFormat/Structures/ParagraphProperties.cs(PAPXparsing)Doc/DocFileFormat/Data/ListFormatOverride.cs(LFOparsing)
- Bullet Preservation: Simple bulleted lists are prefixed with a bullet character (e.g.,
•). - Number Preservation: Simple numbered lists are prefixed with the correct, incrementing number (e.g.,
1.,2.). - Indentation: Nested list items are indented with leading spaces.
- No Regressions: Paragraphs that are not part of a list are rendered correctly without any extra formatting.