Implementation Documentation for Agentic LLM Workflow: macOS ScreenMate (SwiftUI First - Direct VLM, In-Memory Screenshot, Custom Prompts)
Develop a native macOS application ("ScreenMate") that:
- Runs as a menubar accessory application (no Dock icon).
- Provides advanced image understanding functionality triggered by a screenshot, capturing the image into memory (as an
NSImage) and processing it using a locally loaded Vision Language Model (VLM) via MLX Swift, with an option for users to provide custom prompts. (OCR is one of its capabilities). - Features a main interface in a menubar popover panel.
- Features a "Custom Prompt" floating panel allowing users to input their own VLM prompts for image processing.
- Allows configuration for auto-starting at login and selecting a VLM model from a predefined list.
- Uses SwiftUI for UI components where feasible, and AppKit for system integrations and panel management.
- Language: Swift.
- UI Framework: SwiftUI first. Use AppKit for system integration.
- ML Framework: MLX Swift (including
MLXLLM,MLXLMCommon,Tokenizers,Hub) for direct VLM loading and inference. - Screenshot Handling: Capture screenshots directly into memory as
NSImage. - Bridging (UI):
NSHostingViewfor SwiftUI in AppKit panels. - State Management (SwiftUI): Standard SwiftUI state management.
ScreenMateEngineandAppSettingswill beObservableObject. - Modularity: Distinct classes/structs.
- Error Handling: Robust error handling for VLM operations, UI feedback.
- Asynchronous Operations:
async/await,Task.detachedfor VLM. UI updates on main thread. - File Structure: Adhere to suggested structure (
UI,SystemIntegration,Sharedsubfolders).
Component ID: C000 (New)
Name: App Settings (Shared/AppSettings.swift)
- Purpose: An
ObservableObjectto store shared application settings, like the selected VLM model identifier and potentially the last custom prompt. - Key Technologies: Swift, Combine (
ObservableObject,@Published). - Core Responsibilities/Tasks for LLM:
- Create
AppSettings.swift(in aSharedgroup). - Define a class
AppSettingsconforming toObservableObject. - Add
@Published var selectedVLMModelIdentifier: Stringinitialized with a default fromScreenMateEngine.supportedVLMModels(C006). - Add
@Published var lastCustomPrompt: String = "".
- Create
- Inputs/Dependencies:
ScreenMateEngine.supportedVLMModels(C006) for default. - Outputs/Deliverables: An
AppSettingsclass. - Interface with Other Components: Injected as an
@StateObjector@EnvironmentObjectintoMenubarContentView(C004) andSettingsView(C008),CustomPromptView(C012).
Component ID: C001
Name: Application Core Setup (AppDelegate.swift)
- Purpose: Initialize the application, set it as an accessory app, and manage its lifecycle. Instantiate core managers.
- Key Technologies: Swift, AppKit (
NSApplicationDelegate,NSApp). - Core Responsibilities/Tasks for LLM:
- Create/Update
AppDelegate.swift. - Ensure conformance:
NSObject,NSApplicationDelegate. - Implement
applicationDidFinishLaunching(_:):- Set
NSApp.setActivationPolicy(.accessory). - Instantiate
MenuBarManager(C002) and retain it. - Add/Modify method
showCustomPromptPanel()(renamed fromshowSpotlight) to instantiate and showCustomPromptPanelController(C011).
- Set
- Implement
applicationWillTerminate(_:). - Implement
applicationShouldTerminateAfterLastWindowClosed(_:)returningfalse. - Ensure
@maininScreenMateApp.swiftuses@NSApplicationDelegateAdaptor(AppDelegate.self).
- Create/Update
- Inputs/Dependencies: None initially. Will create
MenuBarManager. - Outputs/Deliverables: A functioning
AppDelegate.swift. - Interface with Other Components: Instantiates
MenuBarManager. ProvidesshowCustomPromptPanel(). - Success Criteria/Verification: App launches as accessory.
MenuBarManagerinitialized.showCustomPromptPanel()is callable. - Reference File(s):
ScreenMate/ScreenMate/AppDelegate.swift.
Component ID: C002
Name: MenuBar Management (MenuBarManager.swift)
- Purpose: Create and manage the system menubar icon and its actions.
- Key Technologies: Swift, AppKit (
NSStatusItem,NSStatusBar,NSImage,NSObject). - Core Responsibilities/Tasks for LLM: (As per original plan, ensure image name is "TrayIcon" or a new generic icon for ScreenMate). Default panel dimensions might be around
width: 340, height: 550to accommodate more UI. - Inputs/Dependencies:
PanelController(C003), "TrayIcon" inAssets.xcassets. - Outputs/Deliverables:
MenuBarManager.swift. - Success Criteria: Menubar icon appears and toggles panel.
- Reference File(s):
ScreenMate/ScreenMateCore/MenuBarManager.swift.
Component ID: C003
Name: Menubar Panel Controller (PanelController.swift)
- Purpose: Manage the
NSPanelfor the menubar popover, hosting SwiftUI content. - Key Technologies: Swift, AppKit (
NSWindowController,NSPanel,NSWindowDelegate), SwiftUI (NSHostingView). - Core Responsibilities/Tasks for LLM (Updates):
6. Implement
private func embedSwiftUIView(in panel: NSPanel):- Instantiate
AppSettings(C000) andAutostartManager(C007). - Instantiate
MenubarContentView(C004), injecting bothappSettingsandautostartManageras.environmentObject().
- Instantiate
- Inputs/Dependencies:
MenubarContentView(C004),AppSettings(C000),AutostartManager(C007). - Outputs/Deliverables:
PanelController.swift. - Success Criteria: Panel works, hosts
MenubarContentViewwith injected environment objects. - Reference File(s):
ScreenMate/ScreenMateCore/PanelController.swift.
Component ID: C004
Name: Menubar Content View (UI/MenubarContentView.swift)
- Purpose: Main UI. Interacts with
ScreenshotManager,ScreenMateEngine. Manages VLM model loading based onAppSettings. No Copy button. - Key Technologies: Swift, SwiftUI.
- Core Responsibilities/Tasks for LLM:
- Create
MenubarContentView.swiftinUI/. - State Management:
@EnvironmentObject var appSettings: AppSettings(C000).@StateObject private var screenshotManager = ScreenshotManager()(C005).@StateObject private var screenMateEngine = ScreenMateEngine()(C006 - Renamed fromOCREngine).@State private var processedTextResult: String = "Select a VLM model in Settings and click Load.".@State private var showingSettings = false.@State private var lastScreenshotPreviewImage: Image?.
- Body Layout (VStack):
- Display
screenMateEngine.currentStatusMessage,screenMateEngine.loadedModelNameDisplay. - "Load/Change VLM Model"
Button(orTextFieldfor ID + LoadButton). Action:Task { await screenMateEngine.loadModel(modelIdentifier: appSettings.selectedVLMModelIdentifier) }. Disable based onscreenMateEngine.isLoadingModel. - "Process Screenshot"
Button(renamed from "Take Screenshot & OCR"). Action:processScreenshotWithDefaultPrompt(). Disable appropriately. ScrollViewforprocessedTextResult(selectable, monospaced).- Optional
ImageforlastScreenshotPreviewImage. HStackwith:- (Copy Button Removed)
- "Settings"
Button(with.popoverforSettingsViewC008). - "Custom Prompt"
Button. Action:(NSApp.delegate as? AppDelegate)?.showCustomPromptPanel().
- Display
.onChange(of: appSettings.selectedVLMModelIdentifier): If selection changes, callTask { await screenMateEngine.loadModel(modelIdentifier: appSettings.selectedVLMModelIdentifier) }.- Private Methods:
processScreenshotWithDefaultPrompt(): (wastakeScreenshotAndOCR)- Guard model loaded/engine state.
- Call
screenshotManager.takeScreenshotToImage(...). - On
NSImagereceived, update preview. - Call
screenMateEngine.performOCR(onNSImage: receivedNSImage, customPrompt: screenMateEngine.getDefaultOCRPrompt(), ...) - Handle
Result, updateprocessedTextResult.
- Create
- Inputs/Dependencies:
AppSettings(C000),ScreenshotManager(C005),ScreenMateEngine(C006),SettingsView(C008),NotificationManager(C009),AppDelegate(C001). - Outputs/Deliverables:
MenubarContentView.swift. - Success Criteria: Model loading via
appSettings. Default image processing works. Custom prompt panel invoked. - Reference File(s):
ScreenMate/ScreenMateCore/UI/MenubarContentView.swift.
Component ID: C005
Name: Screenshot Manager (In-Memory) (ScreenshotManager.swift)
- (No functional changes from the "In-Memory Screenshot" plan.)
- Reference File(s):
ScreenMate/ScreenMateCore/ScreenshotManager.swift.
Component ID: C006
Name: ScreenMate Engine (Direct VLM with Custom Prompt Support) (ScreenMateEngine.swift) (Renamed from OCREngine)
- Purpose: Load, manage, run VLM for image processing (including OCR) on
NSImage, supporting custom user prompts. - Key Technologies: Swift, MLX, MLXLLM, MLXLMCommon, Tokenizers, Hub, AppKit.
- Core Responsibilities/Tasks for LLM:
- Rename file and class from
OCREnginetoScreenMateEngine. - (Structure, Published Properties, Error Enum,
init,loadModelas per previous "Direct VLM & In-Memory" plan for C006, usingScreenMateEngineError). - Add
static let supportedVLMModels: [String: String]property: A dictionary of["Display Name": "hub_or_path_identifier"](e.g.,["Llava Phi-3 Mini": "mlx-community/llava-phi-3-mini-128k-instruct-4bit", "Custom Moondream": "/path/to/moondream"]). Initialize with at least one valid VLM. - Add
getDefaultOCRPrompt() -> Stringmethod: Returns a default prompt specifically tuned for OCR (e.g., "Extract all text from this image..."). - Modify
performOCRmethod signature (or rename to a more genericprocessImage):func processImage(onNSImage nsImage: NSImage, prompt: String, completion: @escaping (Result<String, ScreenMateEngineError>) -> Void) - Inside
processImage(...):- Use the provided
promptwhen constructingUserInputmessages. The prompt should already include the VLM-specific image placeholder (e.g.,<image>\nUser's custom prompt here). TheScreenMateEnginemight offer a helper to prepend this placeholder if the user prompt is raw text. - (Image Handling for
UserInputand inference logic remains similar, focusing on in-memoryNSImagetoUserInput.Imageconversion. Agent must prioritize in-memoryUserInput.Imagecreation fromNSImagedata, using temporary files only as a last resort if MLX libraries are restrictive).
- Use the provided
- Rename file and class from
- Inputs/Dependencies:
NSImage, VLM model identifier, custom text prompt string. MLX Swift packages. - Outputs/Deliverables:
ScreenMateEngine.swift. - Success Criteria:
loadModelworks.processImageuses the custom prompt. - Reference File(s): Detailed
OCREngine.swiftexample, renamed toScreenMateEngine.swiftand adapted for custom prompts.
Component ID: C007
Name: Autostart Manager (SystemIntegration/AutostartManager.swift)
- Core Responsibilities/Tasks for LLM (Updates):
- Ensure
appBundleIdentifierandappNameininit()are correctly derived fromBundle.mainfor "ScreenMate" or set to the new app's values.
- Ensure
- Reference File(s):
ScreenMate/ScreenMateCore/SystemIntegration/AutostartManager.swift.
Component ID: C008
Name: Settings View (UI/SettingsView.swift)
- Purpose: UI for settings, including VLM model selection from a predefined list.
- Key Technologies: Swift, SwiftUI.
- Core Responsibilities/Tasks for LLM:
- Create
SettingsView.swift. @EnvironmentObject var appSettings: AppSettings(C000).@EnvironmentObject var autostartManager: AutostartManager(C007).- Body Layout (VStack):
- Title,
Togglefor autostart. - VLM Model Selection Section:
Text("Select VLM Model:").Picker("VLM Model", selection: $appSettings.selectedVLMModelIdentifier):- Iterate over
ScreenMateEngine.supportedVLMModels.keys.sorted(). - For each key (display name), use
ScreenMateEngine.supportedVLMModels[key]!as the tag (identifier string).
- Iterate over
- App version display.
- Title,
- Create
- Inputs/Dependencies:
AppSettings(C000),AutostartManager(C007),ScreenMateEngine.supportedVLMModels(C006). - Outputs/Deliverables:
SettingsView.swiftwith VLM model picker. - Success Criteria: Autostart toggle. VLM model selection updates
appSettings.selectedVLMModelIdentifier. - Reference File(s):
ScreenMate/ScreenMateCore/UI/SettingsView.swift.
Component ID: C009
Name: Notification Manager (SystemIntegration/NotificationManager.swift)
- Reference File(s):
ScreenMate/ScreenMateCore/SystemIntegration/NotificationManager.swift.
Component ID: C010
Name: Custom Prompt Panel Appearance (CustomPromptPanel.swift) (Renamed from SpotlightPanel)
- Purpose: Define the custom
NSPanelfor the floating "Custom Prompt" window. - Key Technologies: Swift, AppKit (
NSPanel). - Core Responsibilities/Tasks for LLM: (Implement as per original C010 plan, renaming file and class to
CustomPromptPanel). - Reference File(s): Rename
SpotlightPanel.swifttoCustomPromptPanel.swift.
Component ID: C011
Name: Custom Prompt Panel Controller (CustomPromptPanelController.swift) (Renamed)
- Purpose: Manage the
CustomPromptPanelwindow and host its SwiftUI content. - Key Technologies: Swift, AppKit (
NSWindowController), SwiftUI (NSHostingView). - Core Responsibilities/Tasks for LLM:
- Create/Rename to
CustomPromptPanelController.swift. - Subclass
NSWindowController. convenience init(customPromptPanel: CustomPromptPanel).- In
AppDelegate.showCustomPromptPanel()(C001):- Create
AppSettings(C000) andScreenMateEngine(C006) instances or retrieve shared instances if they are singletons/globally managed. For simplicity, ifScreenMateEngineis already@StateObjectinMenubarContentView, consider how to share it or pass necessary data. It might be better forCustomPromptViewto also takeAppSettingsand create its ownScreenshotManagerand call a global/sharedScreenMateEngineinstance or a method that uses it. Let's assume for now it can get the mainScreenMateEngineinstance. - Create
CustomPromptView(C012). InjectappSettingsandscreenMateEngineas.environmentObject(). - Host in
NSHostingView, set ascustomPromptPanel.contentView. - Instantiate
CustomPromptPanelController(customPromptPanel: panel).
- Create
- Create/Rename to
- Inputs/Dependencies:
CustomPromptPanel(C010),CustomPromptView(C012),AppSettings(C000),ScreenMateEngine(C006). - Reference File(s): Rename
SpotlightPanelController.swifttoCustomPromptPanelController.swift.
Component ID: C012
Name: Custom Prompt Content View (UI/CustomPromptView.swift) (Renamed from SpotlightContentView)
- Purpose: SwiftUI interface for user to input a custom VLM prompt and process a new screenshot.
- Key Technologies: Swift, SwiftUI.
- Core Responsibilities/Tasks for LLM:
- Create
CustomPromptView.swiftinUI/. - State Management:
@EnvironmentObject var appSettings: AppSettings(C000).@EnvironmentObject var screenMateEngine: ScreenMateEngine(C006).@StateObject private var screenshotManager = ScreenshotManager()(C005).@State private var userPromptText: String = "". (Initialize withappSettings.lastCustomPromptin.onAppear).@State private var customProcessingInProgress: Bool = false.@State private var customProcessingResultText: String = "".@State private var screenshotForCustomPromptPreview: Image?.
- Body Layout (VStack):
Text("Custom VLM Prompt").TextEditor(text: $userPromptText)for multi-line input. Min height, resizable.Button("Take Screenshot & Process with This Prompt"). Action:processScreenshotWithCustomPrompt(). Disable ifscreenMateEngine.modelContainer == nil,screenMateEngine.isLoadingModel, orcustomProcessingInProgress.- Optional
Imageview forscreenshotForCustomPromptPreview. ScrollViewto displaycustomProcessingResultText.
- Private Methods:
processScreenshotWithCustomPrompt():- Set
customProcessingInProgress = true. UpdateappSettings.lastCustomPrompt = userPromptText. - Ensure
userPromptTextincludes the VLM-specific image placeholder (e.g.,<image>\n). The view could prepend this ifuserPromptTextis just the raw question. - Call
screenshotManager.takeScreenshotToImage(...). - On
NSImagereceived, updatescreenshotForCustomPromptPreview. - Call
screenMateEngine.processImage(onNSImage: receivedNSImage, prompt: userPromptTextWithPlaceholder, ...) - Handle
Result, updatecustomProcessingResultText. SetcustomProcessingInProgress = false.
- Set
- Create
- Inputs/Dependencies:
AppSettings(C000),ScreenMateEngine(C006),ScreenshotManager(C005). - Outputs/Deliverables:
CustomPromptView.swift. - Interface with Other Components: Hosted by
CustomPromptPanelController. - Success Criteria: User inputs prompt, triggers screenshot, VLM processes with custom prompt, results displayed.
- Reference File(s):
SpotlightContentView.swiftto be heavily adapted intoCustomPromptView.swift.
Component ID: C013 (Optional - Low Priority)
Name: Workspace Monitor (SystemIntegration/WorkspaceMonitor.swift)
- (No change to its own spec, but its utility might be higher with custom, context-aware prompts).
- Reference File(s):
ScreenMate/ScreenMateCore/SystemIntegration/WorkspaceMonitor.swift.
Component ID: C014 Name: Project Setup & Configuration
- Core Responsibilities/Tasks for LLM (Updates):
- Rename existing project files/targets from "OCRToolbox" to "ScreenMate" (careful, manual steps often needed here first).
- Update Bundle Identifier to reflect "ScreenMate" (e.g.,
com.yourcompany.ScreenMate). - (MLX Dependencies remain essential).
- Reference File(s): Project build settings,
Info.plist.
- Project Renaming (Manual/Guided First): Rename
.xcodeproj, schemes, targets to "ScreenMate". - C014 (Project Setup): Verify Bundle ID for "ScreenMate". Ensure MLX dependencies are linked.
- C000 (AppSettings): Create shared settings.
- C001 (AppDelegate): Update for
showCustomPromptPanel(). - C005 (ScreenshotManager - In-Memory).
- C009 (NotificationManager), C007 (AutostartManager - Update for new Bundle ID/App Name).
- C006 (ScreenMateEngine - Renamed from OCREngine): Implement
supportedVLMModels,getDefaultOCRPrompt(). AdaptprocessImage(wasperformOCR) to takeprompt: String. - C008 (SettingsView): Implement with VLM model
PickerusingScreenMateEngine.supportedVLMModelsand binding toappSettings.selectedVLMModelIdentifier. - C004 (MenubarContentView):
- Inject
AppSettings. RenameocrEnginetoscreenMateEngine. - Implement UI for loading model based on
appSettings.selectedVLMModelIdentifier. - "Process Screenshot" button calls
screenMateEngine.processImagewithscreenMateEngine.getDefaultOCRPrompt(). - Button to invoke
AppDelegate.showCustomPromptPanel(). Remove Copy button.
- Inject
- C010 (CustomPromptPanel - Renamed), C012 (CustomPromptView - Renamed), C011 (CustomPromptPanelController - Renamed): Implement the custom prompt UI and its interactions.
- C003 (PanelController): Ensure
AppSettingsandAutostartManagerare correctly injected intoMenubarContentView. - C002 (MenuBarManager).
- Core functionality fully testable: VLM selection, default processing, custom prompt processing.
- Refinements & Testing.
- Renaming: Be meticulous with renaming "OCRToolbox" to "ScreenMate" and "OCREngine" to "ScreenMateEngine" throughout the codebase, including filenames, class names, variable names, comments, and log messages.
AppSettings(C000): This is a new central piece for settings.ScreenMateEngine(C006):- Add
supportedVLMModelsstatic property. processImage(renamed fromperformOCR) must accept aprompt: String.
- Add
MenubarContentView(C004): Remove Copy button. Drive model loading viaappSettings.SettingsView(C008): ImplementPickerfor VLM selection.- Custom Prompt Feature (C010, C011, C012): This is a significant UI and logic addition.
CustomPromptViewwill need to manage its own screenshot and prompt, then callscreenMateEngine.processImage. Ensure the VLM-specific image placeholder (e.g.<image>\n) is correctly prepended to the user's custom text prompt before sending toScreenMateEngine.