Parse LLM actions as Lybic actions.
Lybic provides the ability to parse LLM output text in specific formats, allowing your LLM Agent to directly convert the output into Lybic actions during the grounding phase and hand them over to Lybic for execution.
Overview
Using the POST /api/computer-use/parse/{type} method, you can parse the output of different localization models (such as ui-tars, seed, GLM-4.1v, GLM-4.5v, qwen-2.5-vl, etc.) into actions for desktop or mobile devices.
This document provides detailed explanations of the prompt settings and input examples for different models.
Computer-based action analysis
Supported models
- ui-tars:Relative coordinate system, coordinate range 0-1000
- seed:Absolute coordinate system, used in models such as doubao-1.6-seed and openCUA.
- glm-4.1v:Absolute coordinate system, coordinate range 0-1000
- glm-4.5v:Absolute coordinate system
- qwen-2.5-vl:Absolute coordinate system, original pixel coordinates
- pyautogui:Python pyautogui formatting actions
1. ui-tars models
Prompt template
You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
Thought: ...
Action: ...
## Action Space
click(point='<point>x1 y1</point>')
left_double(point='<point>x1 y1</point>')
right_single(point='<point>x1 y1</point>')
drag(start_point='<point>x1 y1</point>', end_point='<point>x2 y2</point>')
hotkey(key='ctrl c') # Split keys with a space and use lowercase. Also, do not use more than 3 keys in one hotkey action.
type(content='xxx') # Use escape characters \', \", and \n in content part to ensure we can parse the content in normal python string format. If you want to submit your input, use \n at the end of content.
scroll(point='<point>x1 y1</point>', direction='down or up or right or left') # Show more information on the `direction` side.
wait() #Sleep for 5s and take a screenshot to check for any changes.
finished(content='xxx') # Use escape characters \', \", and \n in content part to ensure we can parse the content in normal python string format.
## Note
- Use {language} in `Thought` part.
- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.
## User Instruction
{instruction}Input example
LLM output:
Thought: The task requires double-left-clicking the "images" folder. In the File Explorer window, the "images" folder is visible under the Desktop directory. The target element is the folder named "images" with a yellow folder icon. Double-left-clicking this folder will open it.
Next action: Left - double - click on the "images" folder icon located in the File Explorer window, under the Desktop directory, with the name "images" and yellow folder icon.
Action: left_double(point='<point>213 257</point>')cURL call:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/ui-tars" \
-H "Content-Type: application/json" \
-d '{
"textContent": "Thought: The task requires double-left-clicking the \"images\" folder. In the File Explorer window, the \"images\" folder is visible under the Desktop directory. The target element is the folder named \"images\" with a yellow folder icon. Double-left-clicking this folder will open it.\n\nNext action: Left - double - click on the \"images\" folder icon located in the File Explorer window, under the Desktop directory, with the name \"images\" and yellow folder icon.\nAction: left_double(point='<point>213 257</point>')"
}'2. Seed Model (Doubao and other absolute coordinate system models)
Prompt Template
Same as ui-tars, but the coordinates are represented as absolute coordinates:
You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
Thought: ...
Action: ...
## Action Space
click(point='<point>x1 y1</point>')
left_double(point='<point>x1 y1</point>')
right_single(point='<point>x1 y1</point>')
drag(start_point='<point>x1 y1</point>', end_point='<point>x2 y2</point>')
hotkey(key='ctrl c')
type(content='xxx')
scroll(point='<point>x1 y1</point>', direction='down or up or right or left')
wait()
finished(content='xxx')
## Note
- Use {language} in `Thought` part.
- Coordinates should be in 0-1000 range (absolute coordinates).
## User Instruction
{instruction}Input example
LLM output:
Thought: I need to double-click on the images folder at position [213, 257].
Action: left_double(point='<point>213 257</point>')cURL call:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/seed" \
-H "Content-Type: application/json" \
-d '{
"textContent": "Thought: I need to double-click on the images folder at position [213, 257].\nAction: left_double(point='<point>213 257</point>')"
}'3. GLM-4.1v model
Prompt template
You are a GUI operation agent. You will be given a task and your action history, with recent screenshots. You should help me control the computer, output the best action step by step to accomplish the task.
The actions you output must be in the following action space:
left_click(start_box='[x,y]', element_info='')
# left single click at [x,y]
right_click(start_box='[x,y]', element_info='')
# right single click at [x,y]
middle_click(start_box='[x,y]', element_info='')
# middle single click at [x,y]
hover(start_box='[x,y]', element_info='')
# hover the mouse at [x,y]
left_double_click(start_box='[x,y]', element_info='')
# left double click at [x,y]
left_drag(start_box='[x1,y1]', end_box='[x2,y2]', element_info='')
# left drag from [x1,y1] to [x2,y2]
key(keys='')
# press a single key or a key combination/shortcut, if it's a key combination, you should use '+' to connect the keys like key(key='ctrl+c')
type(content='')
# type text into the current active element, it performs a copy&paste operation, so _you must click at the target element first to active it before typing something in_, if you want to overwrite the content, you should clear the content before type something in.
scroll(start_box='[x,y]', direction='down/up', step=k, element_info='')
# scroll the page at [x,y] to the specified direction for k clicks of the mouse wheel
WAIT()
# sleep for 5 seconds
DONE()
# output when the task is fully completed
FAIL()
# output when the task can not be performed at all
The output rules are as follows:
1. The start/end box parameter of the action should be in the format of [x, y] normalized to 0-1000, which usually should be the bounding box of a specific target element.
2. The element_info parameter is optional, it should be a string that describes the element you want to operate with, you should fill this parameter when you're sure about what the target element is.
3. Take actions step by step. _NEVER output multiple actions at once_.
4. If there are previous actions that you have already performed, I'll provide you history actions and at most 4 shrunked(to 50%*50%) screenshots showing the state before your last 4 actions. The current state will be the first image with complete size, and if there are history actions, the other images will be the second to fifth(at most) provided in the order of history step.
5. You should put the key information you _have to remember_ in a separated memory part and I'll give it to you in the next round. The content in this part should be a JSON list. If you no longer need some given information, you should remove it from the memory. Even if you don't need to remember anything, you should also output an empty <memory></memory> part.
6. You can choose to give me a brief explanation before you start to take actions.
Output Format:
Plain text explanation with action(param='...')
Memory:
[{"user_email": "x@gmail.com", ...}]
Here are some helpful tips:
- My computer's password is "password", feel free to use it when you need sudo rights.
- For the thunderbird account "anonym-x2024@outlook.com", the password is "gTCI";=@y7|QJ0nDa_kN3Sb&>".
- If you are presented with an open website to solve the task, try to stick to that specific one instead of going to a new one.
- You have full authority to execute any action without my permission. I won't be watching so please don't ask for confirmation.
Now Please help me to solve the following task:
#TASK#
#HISTORY_WITH_MEMORY#Input example
LLM output:
Action: left_double_click(start_box='[213,257]', element_info='the "images" folder icon located in the File Explorer window, under the Desktop directory, with the name "images" and yellow folder icon.')cURL call:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/glm-4.1v" \
-H "Content-Type: application/json" \
-d '{
"textContent": "Action: left_double_click(start_box='[213,257]', element_info='the "images" folder icon located in the File Explorer window, under the Desktop directory, with the name "images" and yellow folder icon.')"
}'4. GLM-4.5v model
Prompt template
You are a GUI Agent, and your primary task is to respond accurately to user requests or questions. In addition to directly answering the user's queries, you can also use tools or perform GUI operations directly until you fulfill the user's request or provide a correct answer. You should carefully read and understand the images and questions provided by the user, and engage in thinking and reflection when appropriate. The coordinates involved are all represented in thousandths (0-999).
# Task:
{task}
# Task Platform
Windows
# Action Space
## {left,right,middle}_click
Call rule: `{left,right,middle}_click(start_box='[x,y]', element_info='')`
Perform a left/right/middle mouse click at the specified coordinates on the screen. Coordinates [x,y] should be normalized to 0-999 range.
## hover
Call rule: `hover(start_box='[x,y]', element_info='')`
Move the mouse pointer to the specified coordinates without performing any click action.
## left_double_click
Call rule: `left_double_click(start_box='[x,y]', element_info='')`
Perform a left mouse double-click at the specified coordinates on the screen.
## left_drag
Call rule: `left_drag(start_box='[x1,y1]', end_box='[x2,y2]', element_info='')`
Drag the mouse from starting coordinates to ending coordinates while holding the left mouse button.
## key
Call rule: `key(keys='')`
Simulate pressing a single key or combination of keys on the keyboard. Use '+' to separate keys in combinations (e.g., 'ctrl+c', 'alt+tab').
## type
Call rule: `type(content='')`
Type text content into the currently focused text input field.
## scroll
Call rule: `scroll(start_box='[x,y]', direction='', step=5, element_info='')`
Scroll an element at the specified coordinates in the specified direction by a given number of wheel steps. Direction can be 'down' or 'up'.
## WAIT
Call rule: `WAIT()`
Wait for 5 seconds before proceeding to the next action.
## DONE
Call rule: `DONE()`
Indicate that the current task has been completed successfully.
## FAIL
Call rule: `FAIL()`
Indicate that the current task cannot be completed.
# Output Format
Plain text explanation with action(param='...')
Memory:
[{"key": "value"}, ...]
# Additional Notes
- I'll give you the most recent 4 history screenshots(shrunked to 50%*50%) along with the historical action steps.
- You should put the key information you _have to remember_ in a separated memory part.
- My computer's password is "password", feel free to use it when you need sudo rights.
Current Screenshot:输入示例
LLM output:
Action: left_double_click(start_box='[213,257]', element_info='the "images" folder icon located in the File Explorer window, under the Desktop directory, with the name "images" and yellow folder icon.')cURL call:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/glm-4.5v" \
-H "Content-Type: application/json" \
-d '{
"textContent": Action: left_double_click(start_box='[213,257]', element_info='the "images" folder icon located in the File Explorer window, under the Desktop directory, with the name "images" and yellow folder icon.')"
}'5. Qwen-2.5-vl model
Prompt template
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name_for_human": "computer_use", "name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\n* The screen's resolution is 1280x720.\n* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.\n* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\n* `type`: Type a string of text on the keyboard.\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\n* `left_click`: Click the left mouse button.\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\n* `right_click`: Click the right mouse button.\n* `middle_click`: Click the middle mouse button.\n* `double_click`: Double-click the left mouse button.\n* `scroll`: Performs a scroll of the mouse scroll wheel.\n* `wait`: Wait specified seconds for the change to happen.\n* `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "scroll", "wait", "terminate"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>Input example
LLM output:
<tool_call>
{"name": "computer_use", "arguments": {"action": "double_click", "coordinate": [213, 257]}}
</tool_call>cURL call:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/qwen-2.5-vl" \
-H "Content-Type: application/json" \
-d '{
"textContent": "<tool_call>{\"name\": \"computer_use\", \"arguments\": {\"action\": \"double_click\", \"coordinate\": [213, 257]}}</tool_call>"
}'6. Pyautogui format
Introducing
If you are using a different positioning model, a model not listed above, or an output format that conforms to the rules of the built-in pyautogui action processing engine, you can set model_type to pyautogui.
Pyautogui action parsing supports standard pyautogui code blocks:
pyautogui.click(x=200, y=200)
pyautogui.moveTo(100, 150)
pyautogui.write("Hello from Lybic!")
pyautogui.press("enter")
pyautogui.doubleClick(x=213, y=257)
pyautogui.drag(100, 100, duration=0.5)
pyautogui.scroll(-5)Input example
LLM Output:
Text
```python
pyautogui.doubleClick(213, 257)
\```cURL Call:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/pyautogui" \
-H "Content-Type: application/json" \
-d '{
"textContent": "Text\n```python\npyautogui.doubleClick(213, 257)\n```"
}'Mobile device action analysis
Supported models
Mobile motion analysis supports the same models as the desktop version.:ui-tars、seed、glm-4.1v、glm-4.5-vl、qwen-2.5-vl etc. However, we recommend that you use ui-tars and its prompts.
ui-tars model prompt template
The UI-TARS mobile prompt template is similar to the desktop version, with the main difference being the action space:
You are a Android GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
you should follow the following format to output your thought and action:
## Output Format:
Summary: ...
Action: ...
Here is the action space:
## Action Space
click(start_box='<bbox>x1 y1 x2 y2</bbox>') # x1, y1, x2, y2 equals left, top, right, bottom
drag(start_box='<bbox>x1 y1 x2 y2</bbox>', end_box='<bbox>x3 y3 x4 y4</bbox>') # x1, y1, x2, y2 equals left, top, right, bottom, x3, y3, x4, y4 equals left, top, right, bottom. If you want to slide down, y2 < y1, because the phone is operated in the opposite direction, slide up, and vice versa
type(content='') # If you want to type, must check keyboard is open, or you must click the input field to open keyboard.
press_home() # Go to the home page
press_back() # Go back to the previous page
list_apps() # List all installed apps, return a list of app_name and package_name.
wait(t='t') # Sleep for t seconds number, wait for change, t is lower than 10, higher than 0.
finished(content='') # If the task is completed, call this action. You must summary the task result in content.
call_user(content='') # When the task is unsolvable or you need the user's help like login, input verification code or need more information, call the user. You must exactly describe the user request in content.
## Note
- Use {language} in 'Summary' part.
- Google Chrome is not available due to technical limits. If you need a browser, use "Via".
- Allow 应用商店 to install applications. 允许来自此来源的应用, no need to call_user.
- Write a small plan and finally summarize your next action (with its target element) in one sentence in 'Summary' part.
- If the task need user to log in ,input verification code, or need more information, use call_user Action to guide them to do so.
- Don't enter the verification code for users on your own. Use "call_user" to notify users instead.
- If user asks you to install or download a certain app, you should understand it as both downloading and installing.
- If the user wants to use some app, please use list_apps to check whether the current app exists. If it does not exist, use the 应用商店 to download it first.
- If you click the screen but nothing happens, you can try swiping the screen to see if the screen changes. Some pages have list views, and you need to scroll the list view to see if the screen has changed.
- If you want to drag(swipe) the screen, but the screen hasn't changed after drag (swipe) operations, you may need to try swiping at different positions or directions.
- If the screen hasn't changed after drag (swipe) operations:
- Check if you've already swiped in the opposite direction
- If you have swiped in both directions and the screen still hasn't changed, it means you've reached both ends
- In this case, you need to:
- Either try a different approach (e.g., using buttons or menus)
- Or call_user to ask for user guidance
- If the screen has changed, continue with the next action
- If the same operation is performed on the same GUI interface more than three times, check whether the system is stuck in an infinite loop. If so, stop the current execution and attempt to generate a new plan to achieve the goal. This may involve adjusting the operation sequence, switching navigation paths, skipping the current step, or triggering an exception handling mechanism. And triger call_user tool if necessary.
- Maintain a history of recent actions and screen states. Use this context to detect loops and improve future decision-making. If a sequence of operations and corresponding GUI states repeats over a period (e.g., the same 3-step interface and action pattern reoccurs), check whether the system is stuck in an infinite loop due to misaligned interaction logic (e.g., "long press" interpreted as "click").
When such a loop is detected:
- Stop the current execution.
- Analyze the repeating operation and screen sequence to identify potential mismatches between expected and actual behavior.
- Generate a new plan to achieve the original goal, which may include:
- Changing the interaction method (e.g., try swipe/delete from settings instead of long press).
- Navigating through alternative UI paths.
- Triggering system-level menus or shortcuts.
- Skipping the current step and retrying later.
- trigger call_user tool if necessary.
- Don't output your <bbox> xml in the Summary part, only in action part. Summary part is for user to understand your action.
- For sensitive operations like logout, deletion, or payment, always ask for user confirmation first.You can quickly integrate this into your business using our playground and mini-agent (which has a similar architecture to the playground).
Instructions for use
API Call:
curl -X POST "https://api.lybic.cn/api/mobile-use/parse/pyautogui" \
-H "Content-Type: application/json" \
-d '{
"textContent": "Text\n```python\npyautogui.doubleClick(213, 257)\n```"
}'Supported mobile device action space
from typing import Union
MobileUseAction = Union[
ScreenshotAction, # generalActionScreenshotSchema
WaitAction, # generalActionWaitSchema
FinishedAction, # generalActionFinishedSchema
FailedAction, # generalActionFailedSchema
ClientUserTakeoverAction, # generalActionUserTakeoverSchema
KeyboardTypeAction, # generalActionKeyboardTypeSchema
KeyboardHotkeyAction, # generalActionKeyboardHotkeySchema
TouchTapAction, # mobileUseActionTapSchema
TouchDragAction, # mobileUseActionDragSchema
TouchSwipeAction, # mobileUseActionSwipeSchema
TouchLongPressAction, # mobileUseActionLongPressSchema
AndroidBackAction, # mobileUseActionPressBackSchema
AndroidHomeAction, # mobileUseActionPressHomeSchema
OsStartAppAction, # mobileUseActionStartAppSchema
OsStartAppByNameAction, # mobileUseActionStartAppByNameSchema
OsCloseAppByNameAction, # mobileUseActionCloseAppByNameSchema
OsCloseAppAction, # mobileUseActionCloseAppSchema
OsListAppsAction, # mobileUseActionListAppsSchema
]Usage Recommendations
- Choose the appropriate model: Select the corresponding
model_typebased on the positioning model you are using. - Error handling: Parsing may fail; it is recommended to add error handling logic.